ML Featurizer Package

Featurizer

class mlfeaturizer.core.featurizer.LogTransformFeaturizer(*args, **kwargs)[source]

Perform Log Transformation on column.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCol()

Gets the value of inputCol or its default value.

getLogType()[source]

Gets the value of logType or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

logType = Param(parent='undefined', name='logType', doc="log type to be used. Options are 'natural' (natural log), 'log10' (log base 10), or 'log2' (log base 2).")
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setLogType(value)[source]

Sets the value of logType.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, logType="natural")[source]

Sets params for this LogTransformFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.PowerTransformFeaturizer(*args, **kwargs)[source]

Perform Power Transformation on column.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getPowerType()[source]

Gets the value of powerType or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

powerType = Param(parent='undefined', name='powerType', doc='power type to be used. Any integer greater than 0. Default is power of 2')
classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, powerType=2)[source]

Sets params for this PowerTransformFeaturizer.

setPowerType(value)[source]

Sets the value of powerType.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.MathFeaturizer(*args, **kwargs)[source]

Perform Math Function Transformation on column.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCol()

Gets the value of inputCol or its default value.

getMathFunction()[source]

Gets the value of mathFunction or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

mathFunction = Param(parent='undefined', name='mathFunction', doc='math function to be used. Default is sqrt')
outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCol(value)

Sets the value of inputCol.

setMathFunction(value)[source]

Sets the value of mathFunction.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, mathFunction="sqrt")[source]

Sets params for this MathFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.DayOfWeekFeaturizer(*args, **kwargs)[source]

Convert date time to day of week.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
format = Param(parent='undefined', name='format', doc='specify timestamp pattern. ')
getFormat()[source]

Gets the value of format or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getTimezone()[source]

Gets the value of timezone or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setFormat(value)[source]

Sets the value of format.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, format="yyyy-MM-dd", timezone="UTC")[source]

Sets params for this DayOfWeekFeaturizer.

setTimezone(value)[source]

Sets the value of timezone.

timezone = Param(parent='undefined', name='timezone', doc='specify timezone. ')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.HourOfDayFeaturizer(*args, **kwargs)[source]

Convert date time to hour of day.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
format = Param(parent='undefined', name='format', doc='specify timestamp pattern. ')
getFormat()[source]

Gets the value of format or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getTimezone()[source]

Gets the value of timezone or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setFormat(value)[source]

Sets the value of format.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, format="yyyy-MM-dd HH:mm:ss", timezone="UTC")[source]

Sets params for this HourOfDayFeaturizer.

setTimezone(value)[source]

Sets the value of timezone.

timezone = Param(parent='undefined', name='timezone', doc='specify timezone. ')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.MonthOfYearFeaturizer(*args, **kwargs)[source]

Convert date time to month of year.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
format = Param(parent='undefined', name='format', doc='specify timestamp pattern. ')
getFormat()[source]

Gets the value of format or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getTimezone()[source]

Gets the value of timezone or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setFormat(value)[source]

Sets the value of format.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, format="yyyy-MM-dd", timezone="UTC")[source]

Sets params for this MonthOfYearFeaturizer.

setTimezone(value)[source]

Sets the value of timezone.

timezone = Param(parent='undefined', name='timezone', doc='specify timezone. ')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.PartsOfDayFeaturizer(*args, **kwargs)[source]

Convert date time to parts of day.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
format = Param(parent='undefined', name='format', doc='specify timestamp pattern. ')
getFormat()[source]

Gets the value of format or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

getTimezone()[source]

Gets the value of timezone or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setFormat(value)[source]

Sets the value of format.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, format="yyyy-MM-dd HH:mm:ss", timezone="UTC")[source]

Sets params for this PartsOfDayFeaturizer.

setTimezone(value)[source]

Sets the value of timezone.

timezone = Param(parent='undefined', name='timezone', doc='specify timezone. ')
transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.AdditionFeaturizer(*args, **kwargs)[source]

Add two numeric columns.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCols(value)

Sets the value of inputCols.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCols=None, outputCol=None)[source]

Sets params for this AdditionFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.SubtractionFeaturizer(*args, **kwargs)[source]

Subtract two numeric columns.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCols(value)

Sets the value of inputCols.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCols=None, outputCol=None)[source]

Sets params for this SubtractionFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.MultiplicationFeaturizer(*args, **kwargs)[source]

Multiply two numeric columns.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCols(value)

Sets the value of inputCols.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCols=None, outputCol=None)[source]

Sets params for this MultiplicationFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.DivisionFeaturizer(*args, **kwargs)[source]

Divide two numeric columns.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getInputCols()

Gets the value of inputCols or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCols = Param(parent='undefined', name='inputCols', doc='input column names.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setInputCols(value)

Sets the value of inputCols.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCols=None, outputCol=None)[source]

Sets params for this DivisionFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.

class mlfeaturizer.core.featurizer.GroupByFeaturizer(*args, **kwargs)[source]

Perform Group By Transformation.

aggregateCol = Param(parent='undefined', name='aggregateCol', doc='aggregate column to be used. ')
aggregateType = Param(parent='undefined', name='aggregateType', doc='aggregate type to be used. Default is count')
copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:extra – Extra parameters to copy to the new instance
Returns:Copy of this instance
explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:extra – extra param values
Returns:merged param map
getAggregateCol()[source]

Gets the value of aggregateCol or its default value.

getAggregateType()[source]

Gets the value of aggregateType or its default value.

getInputCol()

Gets the value of inputCol or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets the value of outputCol or its default value.

getParam(paramName)

Gets a param by its name.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

inputCol = Param(parent='undefined', name='inputCol', doc='input column name.')
isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.')
params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setAggregateCol(value)[source]

Sets the value of aggregateCol.

setAggregateType(value)[source]

Sets the value of aggregateType.

setInputCol(value)

Sets the value of inputCol.

setOutputCol(value)

Sets the value of outputCol.

setParams(self, inputCol=None, outputCol=None, aggregateType="count", aggregateCol=None)[source]

Sets params for this GroupByFeaturizer.

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters:
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame
  • params – an optional param map that overrides embedded params.
Returns:

transformed dataset

New in version 1.3.0.

write()

Returns an MLWriter instance for this ML instance.