CategoricalGrouper¶

class paralytics.preprocessing.CategoricalGrouper(method='freq', percentile_thresh=0.05, new_cat='Other', include_cols=None, exclude_cols=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Groups sparse observations in a categorical columns into one category.

Parameters

method: string {‘freq’}, optional (default=’freq’)

The sparse categories grouping method:

freq:

Counts the frequency against each category. Retains categories whose cumulative share (with respect to descending sort) in the total dataset is equal or higher than the percentile threshold.

percentile_thresh: float, optional (default=.05)

Defines the percentile threshold for ‘freq’ method.

new_cat: string or int, optional (default=’Other’)

Specifies the category name that will be imputed to the chosen sparse observations.

include_cols: list, optional (default=None)

Specifies column names that should be treated like categorical features. If None then estimator is executed only on the automatically selected categorical columns.

exclude_cols: list, optional (default=None)

Specifies categorical column names that should not be treated like categorical features. If None then no column is excluded from transformation.

Attributes

cat_cols_: list: List of categorical columns in a given dataset.
imp_cats_: dict: Dictionary that keeps track of replaced category names with the new category for every feature in the dataset.

Methods Summary

`fit`(self, X[, y])	Fits grouping with X by using given method.
`transform`(self, X)	Apply grouping of sparse categories on X.

Methods Documentation

fit(self, X, y=None)[source]¶

Fits grouping with X by using given method.

Parameters

X: pd.DataFrame, shape = (n_samples, n_features): Training data of independent variable values.
y: ignore

Returns

self: object: Returns the instance itself.

transform(self, X)[source]¶

Apply grouping of sparse categories on X.

Parameters

X: pd.DataFrame, shape = (n_samples, n_features): Data with n_samples as its number of samples.

Returns

X_new: pd.DataFrame, shape = (n_samples_new, n_features): X data with substituted sparse categories to new_cat.