Search code examples
scikit-learnrandom-forestgrid-searchensemble-learning

Grid search on parameters inside the parameters of a BaggingClassifier


This is a follow up on a question answered here, but I believe it deserves its own thread.

In the previous question, we were dealing with “an Ensemble of Ensemble classifiers, where each has its own parameters.” Let's start with the example provided by MaximeKan in his answer:

my_est = BaggingClassifier(RandomForestClassifier(n_estimators = 100, bootstrap = True, 
      max_features = 0.5), n_estimators = 5, bootstrap_features = False, bootstrap = False, 
      max_features = 1.0, max_samples = 0.6 )

Now say I want to go one level above that: Considerations like efficiency, computational cost, etc., aside, and as a general concept: How would I ran grid search with this kind of setup?

I can set up two parameter grids along these lines:

One for the BaggingClassifier:

BC_param_grid = {
'bootstrap': [True, False],
'bootstrap_features': [True, False],    
'n_estimators': [5, 10, 15],
'max_samples' : [0.6, 0.8, 1.0]
}

And one for the RandomForestClassifier:

RFC_param_grid = {
'bootstrap': [True, False],    
'n_estimators': [100, 200, 300],
'max_features' : [0.6, 0.8, 1.0]
}

Now I can call grid search with my estimator:

grid_search = GridSearchCV(estimator = my_est, param_grid = ???)

What do I do with the param_grid parameter in this case? Or more specifically, how do I use both of the parameter grids I set up?

I have to say, it feels like I’m playing with matryoshka dolls.


Solution

  • Following @James Dellinger comment above, and expanding from there, I was able to get it done. Turns out the "secret sauce" is indeed a mostly-undocumented feature - the __ (double underline) separator (there's some passing reference to it in the Pipeline documentation): it seems that adding the inside/base estimator name, followed by this __ to the name of an inside/base estimator parameter, allows you to create a param_grid which covers parameters for both the outside and inside estimators.

    So for the example in the question, the outside estimator is BaggingClassifier and the inside/base estimator is RandomForestClassifier. So what you need to do is, first, to import what needs to be imported:

    from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
    from sklearn.model_selection import GridSearchCV
    

    followed by the param_grid assignments (in this case, those in example in the question):

    param_grid = {
     'bootstrap': [True, False],
     'bootstrap_features': [True, False],    
     'n_estimators': [5, 10, 15],
     'max_samples' : [0.6, 0.8, 1.0],
     'base_estimator__bootstrap': [True, False],    
     'base_estimator__n_estimators': [100, 200, 300],
     'base_estimator__max_features' : [0.6, 0.8, 1.0]
    }
    

    And, finally, your grid search:

    grid_search=GridSearchCV(BaggingClassifier(base_estimator=RandomForestClassifier()), param_grid=param_grid, cv=5)

    And you're off to the races.