machine-learning scikit-learn pipeline grid-search hyperparameters

Is there a way for sklearn pipeline to train with and without a step during a grid search? I can remove steps but how do i pass this to GridSearchCV?

This got closed the first time I asked it because this question asks something similar. However despite the answers showing how to add/remove from a step from the pipeline, none of them show how this works with GridSearchCV and I'm left wondering what to do with the pipeline that I've removed the step from.

I'd like to train a model using a grid search and test the performance both when PCA is performed first and when PCA is omitted. Is there a way to do this? I'm looking for more than simply setting n_components to the number of input variables.

Currently I define my pipeline like this:

pca = PCA()
gbc = GradientBoostingClassifier()
steps = [('pca', pca), ('gbc', gbc)]
pipeline = Pipeline(steps=steps)

param_grid = {
    'pca__n_components': [3, 5, 7],
    'gbc__n_estimators': [50, 100]
    }

search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv=5, scoring='roc_auc')

Solution

For this, you can have a look at the user guide where it says under the paragraph for nested parameters:

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to 'passthrough'

In your case, I would define a grid with a list of two dictionaries, one in case the whole pipeline is used, and one where the PCA is omitted:

param_grid = [
    {
        'pca__n_components': [3, 5, 7],
        'gbc__n_estimators': [50, 100]
    },
    {
        'pca': ['passthrough'], # skip the PCA
        'gbc__n_estimators': [50, 100]
    }
]

GridSearchCV will now span the grids according to each dictionary in the list and try combinations with and without PCA.