TargetEncoder¶

class paralytics.TargetEncoder(columns=None, nan_as_category=True, cv=None, inner_cv=None, shuffle=True, alpha=5, random_state=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Encodes categorical features with the corresponding target value.

If cv param is specified, performs determination of mean values on the way of cross validation within inner cross validation. As a result for each of the outside folds received target aggregated values will be less biased.

Parameters

columns: list, optional (default=None): List of DataFrame columns’ names on which target encoding should be performed. If not specified all categorical columns are taken.
nan_as_category: boolean, optional (default=True): If True includes NaNs as one of the categories and also applies target encoding for this subgroup.
cv: int, optional (default=None): Number of cross-validation folds.
inner_cv: int, optional (default=None): Number of inner cross-validation folds.
shuffle: boolean, optional (default=True): Whether to shuffle the data before splitting into batches.
alpha: int, optional (default=5): Regularization value (times of global mean added to the weighted mean of each category). The larger, the more conservative the algorithm will be. If you want to use the standard mean just set alpha to 0.
random_state: int, optional (default=None): Random state for sklearn algorithms.

See also

paralytics.preprocessing.CategoricalGrouper

Notes

When setting cross-validation parameters remember that all categories must be sufficiently represented. If a category is sparse, because of the lack of representation in one of the k-folds, NaNs in this fold will be generated because there are no values recorded from which the statistics are calculated. A simple solution is to apply the transformator: preprocessing.CategoricalGrouper that groups sparse categories into one category, before using the target encoding.

Attributes

cat_aggval_: dict: Dictionary of dictionaries of corresponding aggregated values to given subgroups. The key is the column name and the value is the dictionary in which the key is the subgroup name and the value is the fitted target aggregated value.

Methods Summary

`fit`(self, X, y)	Fits corresponding target aggregated values to categorical subgroups.
`fit_transform`(self, X[, y])	Fit to data then transform it.
`transform`(self, X[, y])	Applies target encoding on X.

Methods Documentation

fit(self, X, y)[source]¶

Fits corresponding target aggregated values to categorical subgroups.

Parameters

X: DataFrame, shape=(n_samples, n_features): Training data of independent categorical variables.
y: array-like, shape=(n_samples, ): Vector of target variable values corresponding to X data.

Returns

self: object: Returns the instance itself.

fit_transform(self, X, y=None)[source]¶

Fit to data then transform it.

Fits transformer to X and y and returns transformed version of X.

Parameters

X: DataFrame, shape = (n_samples, n_features): Training data of independent categorical variables.
y: array-like, shape = (n_samples, ): Vector of target variable values corresponding to X data.

Returns

X_new: DataFrame, shape = (n_samples, n_features): X data with substituted values to their respective target aggregated values.

transform(self, X, y=None)[source]¶

Applies target encoding on X.

X is target encoded with the aggregated values kept in the cat_aggval_ and for the training data encoding is made with additional spread obtained in the cross-validation within cross-validation.

Parameters

X: DataFrame, shape = (n_samples, n_features): New data with n_samples as its number of samples.
y: array-like, shape = (n_samples, ): Vector of target variable values corresponding to X data.

Returns

X_new: DataFrame, shape = (n_samples, n_features): X data with substituted values to their respective target aggregated values.