Search code examples
pythonscikit-learnpipelinefeature-selection

How to get name of selected features when there are several feature selection methods in sklearn pipeline?


I want to use several feature selection methods in a sklearn pipeline as below:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


model = Pipeline([('vt', VarianceThreshold(0.01)),
                  ('kbest', SelectKBest(chi2, k=5)),
                  ('gbc', GradientBoostingClassifier(random_state=0))])


model.fit(X_train, y_train)
y_pred = model.predict(X_test)

I want to get name or column index of selected features. The point is that the 2nd feature selection step gets the output of the 1st feature selection step (not original X_train). Therefore, when I use methods like get_support() or get_feature_names_out() for the 2nd feature selection step, the feature names or indices don't match with the original input features.

vt = model['vt']
vt.get_feature_names_out()
vt.get_support()


kbest = model['kbest']
kbest.get_feature_names_out()
kbest.get_support()

For example, when I run vt.get_support(), I get an array of boolean with 30 entires. But, when I run kbest.get_support(), I get an array of boolean with 14 entires. It means that the name or column index of data input to the 2nd feature selection method was reset and there is no match with input data to the 1st feature selction method.

How to solve this issue?


Solution

  • In case it is enough for you to get the names of the selected features without caring about which features are selected in which step**, the following might be an easy way to go.

    You can just return your input X as a dataframe via the parameter as_frame set to True (X, y = load_breast_cancer(return_X_y=True, as_frame=True)). This will allow you to get feature names as strings, which in turn allows method .get_feature_names_out() to return the selected features with the original names. The same does not happen in case you work with a numpy array as they do not have explicit column names.

    from sklearn.datasets import load_breast_cancer
    from sklearn.feature_selection import SelectKBest, chi2
    from sklearn.feature_selection import VarianceThreshold
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    X, y = load_breast_cancer(return_X_y=True, as_frame=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    model = Pipeline([('vt', VarianceThreshold(0.01)),
                      ('kbest', SelectKBest(chi2, k=5)),
                      ('gbc', GradientBoostingClassifier(random_state=0))])
    
    model.fit(X_train, y_train)
    
    model[:-1].get_feature_names_out()
    

    enter image description here

    ** btw this will enable you to get the original name of the selected features also for the first transformer, but unfortunately not for the second one as the dataframe becomes a numpy array in the meanwhile.

    vt = model['vt']
    vt.get_feature_names_out()
    

    enter image description here