find_sparsity¶

paralytics.utils.find_sparsity(X, thresh=0.01)[source]¶

Finds columns with highly sparse categories.

For categorical and binary features finds columns where categories with relative frequencies under the threshold are present.

For numerical features (excluding binary variables) returns columns where NaNs or 0 are dominating in the given dataset.

Parameters

X: pandas.DataFrame: Data to be checked for sparsity.
thresh: float, optional (default=.01): Fraction of one of the categories under which the sparseness will be reported.

Returns

sparse_{num, bin, cat}: list: List of {numerical, binary, categorical} X column names where high sparsity was detected.