Search code examples
pythonmachine-learningscikit-learndata-sciencegrid-search

How to perform feature selection with gridsearchcv in sklearn in python


I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.

EDIT:

When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=42, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

I could resolve the above issue by using estimator__ in the param_grid parameter list.


My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

I am happy to provide more details if needed.


Solution

  • Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

    Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

    May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

    from sklearn.datasets import load_breast_cancer
    from sklearn.feature_selection import RFECV
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.33, 
                                                        random_state=42)
    
    
    from sklearn.pipeline import Pipeline
    
    #this is the classifier used for feature selection
    clf_featr_sele = RandomForestClassifier(n_estimators=30, 
                                            random_state=42,
                                            class_weight="balanced") 
    rfecv = RFECV(estimator=clf_featr_sele, 
                  step=1, 
                  cv=5, 
                  scoring = 'roc_auc')
    
    #you can have different classifier for your final classifier
    clf = RandomForestClassifier(n_estimators=10, 
                                 random_state=42,
                                 class_weight="balanced") 
    CV_rfc = GridSearchCV(clf, 
                          param_grid={'max_depth':[2,3]},
                          cv= 5, scoring = 'roc_auc')
    
    pipeline  = Pipeline([('feature_sele',rfecv),
                          ('clf_cv',CV_rfc)])
    
    pipeline.fit(X_train, y_train)
    pipeline.predict(X_test)
    

    Now, you can apply this pipeline (Including feature selection) for test data.