CategoricalGrouper

class paralytics.preprocessing.CategoricalGrouper(method='freq', percentile_thresh=0.05, new_cat='Other', include_cols=None, exclude_cols=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Groups sparse observations in a categorical columns into one category.

Parameters
method: string {‘freq’}, optional (default=’freq’)

The sparse categories grouping method:

  • freq:

    Counts the frequency against each category. Retains categories whose cumulative share (with respect to descending sort) in the total dataset is equal or higher than the percentile threshold.

percentile_thresh: float, optional (default=.05)

Defines the percentile threshold for ‘freq’ method.

new_cat: string or int, optional (default=’Other’)

Specifies the category name that will be imputed to the chosen sparse observations.

include_cols: list, optional (default=None)

Specifies column names that should be treated like categorical features. If None then estimator is executed only on the automatically selected categorical columns.

exclude_cols: list, optional (default=None)

Specifies categorical column names that should not be treated like categorical features. If None then no column is excluded from transformation.

Attributes
cat_cols_: list

List of categorical columns in a given dataset.

imp_cats_: dict

Dictionary that keeps track of replaced category names with the new category for every feature in the dataset.

Methods Summary

fit(self, X[, y])

Fits grouping with X by using given method.

transform(self, X)

Apply grouping of sparse categories on X.

Methods Documentation

fit(self, X, y=None)[source]

Fits grouping with X by using given method.

Parameters
X: pd.DataFrame, shape = (n_samples, n_features)

Training data of independent variable values.

y: ignore
Returns
self: object

Returns the instance itself.

transform(self, X)[source]

Apply grouping of sparse categories on X.

Parameters
X: pd.DataFrame, shape = (n_samples, n_features)

Data with n_samples as its number of samples.

Returns
X_new: pd.DataFrame, shape = (n_samples_new, n_features)

X data with substituted sparse categories to new_cat.