VIFSelector

class paralytics.VIFSelector(thresh=5.0, impute=False, impute_method='mean', fit_intercept=True, verbose=0)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Makes feature selection based on Variance Inflation Factor.

Calculates Variance Inflation Factor for a given dataset, in each iteration discarding the variable with the highest VIF value and repeats this process until it is not below the declared threshold.

Parameters
thresh: float, optional (default=5.0)

Threshold value after which further rejection of variables is discontinued.

impute: boolean, optional (default=False)

Declares whether missing values imputation should be performed.

impute_method: string, optional (default=”mean”)

Declares numerical imputation method for the paralytics.preprocessing.Imputer.

fit_intercept: bool, optional (default=True)

Specifies if the constant (a.k.a. bias or intercept) should be added to the decision functions.

verbose: int, optional (default=0)

Controls verbosity of output. If 0 there is no output, if 1 displays

References

[1] Ffisegydd, sklearn multicollinearity class, 2017

Attributes
imputer_: estimator

The estimator by means of which missing values imputation is performed.

viffed_cols_: list

List of features from a given dataset that exceeded thresh.

kept_cols_: list

List of features that left after the vif procedure.

Methods Summary

fit(self, X[, y])

Fits columns with a VIF value exceeding the threshold.

transform(self, X)

Apply feature selection based on Variance Inflation Factor.

Methods Documentation

fit(self, X, y=None)[source]

Fits columns with a VIF value exceeding the threshold.

If specified, fits the imputer on X.

Parameters
X: DataFrame, shape = (n_samples, n_features)

Input data, where n_samples is the number of samples and n_features is the number of features.l

Returns
self: object

Returns the instance itself.

transform(self, X)[source]

Apply feature selection based on Variance Inflation Factor.

Until the maximum VIF in the given dataset does not exceed the declared threshold, in every iteration independent variables’ VIF values are calculated and the variable with the highest VIF value is removed.

Parameters
X: DataFrame, shape = (n_samples, n_features)

Input data on which variables elimination will be applied.

Returns
X_new: DataFrame, shape = (n_samples, n_features_new)

X data with variables remaining after applying feature elimination.