I am using a PipeLine, GridsearchCV, and I am trying to retrieve the features selected
pipeline = Pipeline(
[
('transform', SimpleImputer(strategy='mean')),
('selector',SelectKBest(f_regression)),
('classifier',KNeighborsClassifier())
]
)
scoring = ['precision', 'recall','accuracy']
CV = StratifiedKFold(n_splits = 4, random_state = None, shuffle=True)
search = GridSearchCV(
estimator = pipeline,
param_grid = {'selector__k':[3,4,5,6,7,8,9,10],
'classifier__n_neighbors':[3,4,5,6,7], 'classifier__weights' :['uniform', 'distance'],
'classifier__algorithm' :['auto', 'ball_tree', 'kd_tree', 'brute'],
'classifier__p':[1,2] },
n_jobs=-1,
refit='accuracy',
scoring=scoring,
cv=CV,
verbose=0
)
search.fit(data,target)
selectkbest acts on the training data of each split instead of the whole dataset which is perfect. the confusing part for me is this line that returns a set of features:
search.best_estimator_.named_steps['selector'].get_support()
what are the features I am getting here? I assume that for each iteration there is a different set of selected features based on the split.
Since the search parameter refit
is not False
, the best-performing set of hyperparameters has been used to refit a model onto the entire training set (no splitting into folds for this part); that single model is what is exposed in the attribute best_estimator_
.
You could define an additional scoring
callable that would return the feature list; then in cv_results_
you would have the features selected for each hyperparameter combination and each fold.