python machine-learning scikit-learn feature-selection chi-squared

How to use sklearn ( chi-square or ANOVA) to removes redundant features

Under feature selection step we want to identify relevant features and remove redundant features.

From my understanding redundant features are depended features. (so we want to leave only independent features between features to them self)

My question is about removing redundant features using sklearn and ANOVA / Chi-square tests.

From what I read (and saw examples) we are using SelectKBest or SelectPercentile to leave best features which are depended with the target (y)

But can we use those methods with chi2, f_classif in order to remove depended features ?

In other words, I want to remove redundant features with sklearn methods. How can we do it ?

Solution

You can use SelectKBest in order to score the features using a provided function (e.g. chi-square) and get the N highest scoring features. For example, in order to keep the top 10 features you can use the following:

from sklearn.feature_selection import SelectKBest, chi2, f_classif

# chi-square
top_10_features = SelectKBest(chi2, k=10).fit_transform(X, y)

# or ANOVA
top_10_features = SelectKBest(f_classif, k=10).fit_transform(X, y)

However, there are typically many methods and techniques which are useful in the context of feature reduction. You typically need to decide which methods to use based on your data, the model you are training and the output you want to predict. For example, even if you end up with 20 features, you also need to check what is the correlation between each pair of features and remove one in case they are highly correlated.

The following function will give you the highest correlated features. You can use this output to further reduce your current variable list:

def get_feature_correlation(df, top_n=None, corr_method='spearman',
                            remove_duplicates=True, remove_self_correlations=True):
    """
    Compute the feature correlation and sort feature pairs based on their correlation

    :param df: The dataframe with the predictor variables
    :type df: pandas.core.frame.DataFrame
    :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
    :param corr_method: Correlation compuation method
    :type corr_method: str
    :param remove_duplicates: Indicates whether duplicate features must be removed
    :type remove_duplicates: bool
    :param remove_self_correlations: Indicates whether self correlations will be removed
    :type remove_self_correlations: bool

    :return: pandas.core.frame.DataFrame
    """
    corr_matrix_abs = df.corr(method=corr_method).abs()
    corr_matrix_abs_us = corr_matrix_abs.unstack()
    sorted_correlated_features = corr_matrix_abs_us \
        .sort_values(kind="quicksort", ascending=False) \
        .reset_index()

    # Remove comparisons of the same feature
    if remove_self_correlations:
        sorted_correlated_features = sorted_correlated_features[
            (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
        ]

    # Remove duplicates
    if remove_duplicates:
        sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]

    # Create meaningful names for the columns
    sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)'] 

    if top_n:
        return sorted_correlated_features[:top_n]

    return sorted_correlated_features

Other options could be:

Percentage of missing values
Correlation with target variable
Include some random variables and see if they make it to the subsequent reduced variable lists
Feature stability over time
etc.

As I mentioned, it actually depends on what you are trying to achieve.