Search code examples
machine-learningscikit-learnfeature-selectiongridsearchcv

get_support of the features selected in GridSearch CV


I am using a PipeLine, GridsearchCV, and I am trying to retrieve the features selected

pipeline = Pipeline(
    [
      ('transform', SimpleImputer(strategy='mean')),  
     ('selector',SelectKBest(f_regression)),
     ('classifier',KNeighborsClassifier())
    ]
)
scoring = ['precision', 'recall','accuracy']

CV = StratifiedKFold(n_splits = 4, random_state = None, shuffle=True)

search = GridSearchCV(
    estimator = pipeline,
    param_grid = {'selector__k':[3,4,5,6,7,8,9,10], 
    'classifier__n_neighbors':[3,4,5,6,7], 'classifier__weights' :['uniform', 'distance'],
    'classifier__algorithm' :['auto', 'ball_tree', 'kd_tree', 'brute'],
                'classifier__p':[1,2]  },
    n_jobs=-1,
    refit='accuracy',
    scoring=scoring,
    cv=CV,
    verbose=0

    )
search.fit(data,target)

selectkbest acts on the training data of each split instead of the whole dataset which is perfect. the confusing part for me is this line that returns a set of features:

search.best_estimator_.named_steps['selector'].get_support()  

what are the features I am getting here? I assume that for each iteration there is a different set of selected features based on the split.


Solution

  • Since the search parameter refit is not False, the best-performing set of hyperparameters has been used to refit a model onto the entire training set (no splitting into folds for this part); that single model is what is exposed in the attribute best_estimator_.

    You could define an additional scoring callable that would return the feature list; then in cv_results_ you would have the features selected for each hyperparameter combination and each fold.