Search code examples
pythonmachine-learningscikit-learnfeature-selectionchi-squared

How to use sklearn ( chi-square or ANOVA) to removes redundant features


Under feature selection step we want to identify relevant features and remove redundant features.

From my understanding redundant features are depended features. (so we want to leave only independent features between features to them self)

My question is about removing redundant features using sklearn and ANOVA / Chi-square tests.

From what I read (and saw examples) we are using SelectKBest or SelectPercentile to leave best features which are depended with the target (y)

But can we use those methods with chi2, f_classif in order to remove depended features ?

In other words, I want to remove redundant features with sklearn methods. How can we do it ?


Solution

  • You can use SelectKBest in order to score the features using a provided function (e.g. chi-square) and get the N highest scoring features. For example, in order to keep the top 10 features you can use the following:

    from sklearn.feature_selection import SelectKBest, chi2, f_classif
    
    # chi-square
    top_10_features = SelectKBest(chi2, k=10).fit_transform(X, y)
    
    # or ANOVA
    top_10_features = SelectKBest(f_classif, k=10).fit_transform(X, y)
    

    However, there are typically many methods and techniques which are useful in the context of feature reduction. You typically need to decide which methods to use based on your data, the model you are training and the output you want to predict. For example, even if you end up with 20 features, you also need to check what is the correlation between each pair of features and remove one in case they are highly correlated.

    The following function will give you the highest correlated features. You can use this output to further reduce your current variable list:

    def get_feature_correlation(df, top_n=None, corr_method='spearman',
                                remove_duplicates=True, remove_self_correlations=True):
        """
        Compute the feature correlation and sort feature pairs based on their correlation
    
        :param df: The dataframe with the predictor variables
        :type df: pandas.core.frame.DataFrame
        :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
        :param corr_method: Correlation compuation method
        :type corr_method: str
        :param remove_duplicates: Indicates whether duplicate features must be removed
        :type remove_duplicates: bool
        :param remove_self_correlations: Indicates whether self correlations will be removed
        :type remove_self_correlations: bool
    
        :return: pandas.core.frame.DataFrame
        """
        corr_matrix_abs = df.corr(method=corr_method).abs()
        corr_matrix_abs_us = corr_matrix_abs.unstack()
        sorted_correlated_features = corr_matrix_abs_us \
            .sort_values(kind="quicksort", ascending=False) \
            .reset_index()
    
        # Remove comparisons of the same feature
        if remove_self_correlations:
            sorted_correlated_features = sorted_correlated_features[
                (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
            ]
    
        # Remove duplicates
        if remove_duplicates:
            sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
    
        # Create meaningful names for the columns
        sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)'] 
    
        if top_n:
            return sorted_correlated_features[:top_n]
    
        return sorted_correlated_features
    

    Other options could be:

    • Percentage of missing values
    • Correlation with target variable
    • Include some random variables and see if they make it to the subsequent reduced variable lists
    • Feature stability over time
    • etc.

    As I mentioned, it actually depends on what you are trying to achieve.