Search code examples
pythonscikit-learnpipelinecross-validationfeature-selection

Grid searching hyper-parameters of SVM-anova and get the chosen feature in Sklearn


There is an example in doc of sklearn SVM-Anova. I want to further doGridSearchCV for hyper-paremeters, i.d., C and gamma for SVM, for every percentile of features used in the example like this:

transform = feature_selection.SelectPercentile(feature_selection.f_classif)
clf = Pipeline([('anova', transform), 
                ('normal',preprocessing.StandardScaler()),
                ('svc', svm.SVC())])
parameters = {
'svc__gamma': (1e-3, 1e-4),
'svc__C': (1, 10, 100, 1000)
}      

percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)
for percentile in percentiles:
    clf.set_params(anova__percentile=percentile)
    search = GridSearchCV(clf, parameters,cv=StratifiedKFold(y,7,shuffle=True, random_state=5), scoring='roc_auc', n_jobs=1)
    search.fit(X,y)

It works fine, by doing this I can tune the parameters of Anova and SVM simultaneously and use such pair of parameters to build my final model.

However, I am confused about how it works. Does it firstly split the data and go through the pipeline? If so, how can I determine features chosen by Anova if I want to further gain insight of those selected features?

Say, I get a best CV score using a pair of parameters (percentile for Anova and C/gamma for SVM), how could I find out exactly what features were retained in that settings? Because every setting of parameters were tested under CV, each of which contains folds with different training data and therefore different feature set to be evaluated by Anova.

One way I could come out is to intersect the feature sets retained in each fold for that best performing pair of parameters, but I don't know how to modify the code to do it.

Any suggestion or doubt on the method is appreciated and welcomed.


Solution

  • You could get rid over the loop over percentiles and just put the percentiles in the parameter grid. Then you can look at the selected features of search.best_estimator_, that is search.best_estimator_.named_steps['anova'].get_support()