Search code examples
scikit-learnrandom-forestgridsearchcvrfe

How to use 'max_features' in Gridsearch when combining with RFECV?


Thank you for answering in advance. This is my first post and I am relatively new to python, so I apologize if I have formatted something terribly.

I am trying to combine recursive feature elimination and grid search in sklearn to determine the best combination of hyperparameters and number of features. When using the code below, I get max_features must be in (0, n_features] Estimator fit failed. for anything other than max_features is 1. I have over 300 features in my dataset, many of them likely unimportant.

'''

    param_dist = {'estimator__n_estimators': [i for i in range(11, 121, 10)],
                  'estimator__criterion': ['gini', 'entropy']}
                  'estimator__max_features': [i for i in range(1, 10)]}



    estimator = sklearn.ensemble.RandomForestClassifier(n_jobs=-1, random_state=42, bootstrap=True, verbose=True, max_features='auto')

    selector = sklearn.feature_selection.RFECV(estimator=estimator, step=1, cv=5,
                                                scoring='accuracy')

    rf_nested = sklearn.model_selection.GridSearchCV(estimator=selector, param_grid=param_dist, cv=5,
                                                        scoring='accuracy', n_jobs=-1, refit=True, return_train_score=True)


    rf_nested.fit(X_train, y_train)

'''


Solution

  • I would not mix the feature selection step and the hyperparameter optimization one.

    The thing is that you're passing a selector to Grid Search instead of an estimator - this might end up working but it's maybe not the best idea, as the two are different classes and have different methods.

    I am not sure about the implementations of the two, but it looks plausible that RFECV might return a small number of features if only few are important (maybe even just one), while GridSearchCV might want to test a higher number of features - which are therefore not available.

    Also, what you're doing is a cross validation within a cross validation, which seems rather unnecessary. For every evaluation of Grid Search you run your selector 5 times, which in turn runs the Random Forest 5 times to select the number of features.

    In the end, I think you would be better off separating the two steps. Find the most important features first through RFECV, and then find the best parameter for max_features.

    Final (unrelated) advice: I would not Grid Search on the number of estimators. Random Forests tend not to overfit, so the best option is to set up an early stopping parameter when cross validating :)