Search code examples
pythonmachine-learningscikit-learnpipeline

Is it possible to optimize hyperparameters for optional sklearn pipeline steps?


I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel - sfm).

clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))

p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
             "clf__n_estimators": [10, 50, 100, 200, 500, 800],
             "clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
             "sfm": ['passthrough', sfm],
             "sfm__max_depth": [10, 50, 100, None],
             "sfm__n_estimators": [10, 50, 100, 200, 500, 800],
             "sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
            }

pipeline=Pipeline([
                 ('scl',stdscl),
                 ('sfm',sfm),
                 ('clf',clf)
                  ])

gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)

clf = gs_clf.best_estimator_

The error that I get is 'string' object has no attribute 'set_params' which is understandable. Is there a way to specify which combinations should be tried together, in my case only 'passthrough' by itself and sfm with different hyperparameters?

Thanks!


Solution

  • As specified by @Robin, you might define p_grid_lr as a list of dictionaries. Indeed, here is what the docs of GridSearchCV states at this proposal:

    param_grid: dict or list of dictionaries

    Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

    p_grid_lr = [
        {
            "clf__max_depth": [10, 50, 100, None],
            "clf__n_estimators": [10, 50, 100, 200, 500, 800],
            "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
            "sfm__estimator__max_depth": [10, 50, 100, None],
            "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
            "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        },
        {
            "clf__max_depth": [10, 50, 100, None],
            "clf__n_estimators": [10, 50, 100, 200, 500, 800],
            "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
            "sfm": ['passthrough'],
        }
    ]
    

    A less scalable alternative (for your case) might be the following

    p_grid_lr_ = {
        "clf__max_depth": [10, 50, 100, None],
        "clf__n_estimators": [10, 50, 100, 200, 500, 800],
        "clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
        "sfm": ['passthrough', 
                SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
                SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
                ...]
    }
    

    specifying all of the possible combinations for your parameters.

    Moreover, be aware that to access parameters max_depth, n_estimators and max_features from the RandomForestRegressor estimator within SelectFromModel you should type parameters as

    "sfm__estimator__max_depth": [10, 50, 100, None],
    "sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
    "sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
    

    rather than as

    "sfm__max_depth": [10, 50, 100, None],
    "sfm__n_estimators": [10, 50, 100, 200, 500, 800],
    "sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
    

    because these parameters are from the estimator itself (max_features in principle might also be a parameter from SelectFromModel, but in such a case it may only attain integer values as from docs).

    In general you can access all the parameters to be possibly optimized via pipeline.get_params().keys() (estimator.get_params().keys() in general).

    Eventually, here's a nice reading from the user guide for Pipelines.