python machine-learning scikit-learn pipeline

Is it possible to optimize hyperparameters for optional sklearn pipeline steps?

I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel - sfm).

clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))

p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
             "clf__n_estimators": [10, 50, 100, 200, 500, 800],
             "clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
             "sfm": ['passthrough', sfm],
             "sfm__max_depth": [10, 50, 100, None],
             "sfm__n_estimators": [10, 50, 100, 200, 500, 800],
             "sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
            }

pipeline=Pipeline([
                 ('scl',stdscl),
                 ('sfm',sfm),
                 ('clf',clf)
                  ])

gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)

clf = gs_clf.best_estimator_

The error that I get is 'string' object has no attribute 'set_params' which is understandable. Is there a way to specify which combinations should be tried together, in my case only 'passthrough' by itself and sfm with different hyperparameters?

Thanks!

Solution

As specified by @Robin, you might define p_grid_lr as a list of dictionaries. Indeed, here is what the docs of GridSearchCV states at this proposal:

param_grid: dict or list of dictionaries

Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

p_grid_lr = [
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm__estimator__max_depth": [10, 50, 100, None],
        "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
        "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    },
    {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm": ['passthrough'],
    }
]

A less scalable alternative (for your case) might be the following

p_grid_lr_ = {
    "clf__max_depth": [10, 50, 100, None],
    "clf__n_estimators": [10, 50, 100, 200, 500, 800],
    "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
    "sfm": ['passthrough', 
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
            SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
            ...]
}

specifying all of the possible combinations for your parameters.

Moreover, be aware that to access parameters max_depth, n_estimators and max_features from the RandomForestRegressor estimator within SelectFromModel you should type parameters as

"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

rather than as

"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']

because these parameters are from the estimator itself (max_features in principle might also be a parameter from SelectFromModel, but in such a case it may only attain integer values as from docs).

In general you can access all the parameters to be possibly optimized via pipeline.get_params().keys() (estimator.get_params().keys() in general).

Eventually, here's a nice reading from the user guide for Pipelines.