Search code examples
pythonpandasscikit-learnfeature-extractionfeature-selection

The easiest way for getting feature names after running SelectKBest in Scikit Learn


I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:

Let's assume I would like to conduct the experiment selecting 5 best features:

from sklearn.feature_selection import SelectKBest, f_classif

select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

Now, if I add the line:

import pandas as pd

dataframe = pd.DataFrame(select_k_best_classifier)

I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)

My question is how to create the features_names list?

I know that I should use:

 select_k_best_classifier.get_support()

Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.

How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?


Solution

  • You can do the following :

    mask = select_k_best_classifier.get_support() #list of booleans
    new_features = [] # The list of your K best features
    
    for bool_val, feature in zip(mask, feature_names):
        if bool_val:
            new_features.append(feature)
    

    Then change the name of your features:

    dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)