Thank you for answering in advance. This is my first post and I am relatively new to python, so I apologize if I have formatted something terribly.
I am trying to combine recursive feature elimination and grid search in sklearn to determine the best combination of hyperparameters and number of features. When using the code below, I get max_features must be in (0, n_features] Estimator fit failed. for anything other than max_features is 1. I have over 300 features in my dataset, many of them likely unimportant.
'''
param_dist = {'estimator__n_estimators': [i for i in range(11, 121, 10)],
'estimator__criterion': ['gini', 'entropy']}
'estimator__max_features': [i for i in range(1, 10)]}
estimator = sklearn.ensemble.RandomForestClassifier(n_jobs=-1, random_state=42, bootstrap=True, verbose=True, max_features='auto')
selector = sklearn.feature_selection.RFECV(estimator=estimator, step=1, cv=5,
scoring='accuracy')
rf_nested = sklearn.model_selection.GridSearchCV(estimator=selector, param_grid=param_dist, cv=5,
scoring='accuracy', n_jobs=-1, refit=True, return_train_score=True)
rf_nested.fit(X_train, y_train)
'''
I would not mix the feature selection step and the hyperparameter optimization one.
The thing is that you're passing a selector to Grid Search instead of an estimator - this might end up working but it's maybe not the best idea, as the two are different classes and have different methods.
I am not sure about the implementations of the two, but it looks plausible that RFECV might return a small number of features if only few are important (maybe even just one), while GridSearchCV might want to test a higher number of features - which are therefore not available.
Also, what you're doing is a cross validation within a cross validation, which seems rather unnecessary. For every evaluation of Grid Search you run your selector 5 times, which in turn runs the Random Forest 5 times to select the number of features.
In the end, I think you would be better off separating the two steps. Find the most important features first through RFECV, and then find the best parameter for max_features.
Final (unrelated) advice: I would not Grid Search on the number of estimators. Random Forests tend not to overfit, so the best option is to set up an early stopping parameter when cross validating :)