This got closed the first time I asked it because this question asks something similar. However despite the answers showing how to add/remove from a step from the pipeline, none of them show how this works with GridSearchCV
and I'm left wondering what to do with the pipeline that I've removed the step from.
I'd like to train a model using a grid search and test the performance both when PCA is performed first and when PCA is omitted. Is there a way to do this? I'm looking for more than simply setting n_components
to the number of input variables.
Currently I define my pipeline like this:
pca = PCA()
gbc = GradientBoostingClassifier()
steps = [('pca', pca), ('gbc', gbc)]
pipeline = Pipeline(steps=steps)
param_grid = {
'pca__n_components': [3, 5, 7],
'gbc__n_estimators': [50, 100]
}
search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv=5, scoring='roc_auc')
For this, you can have a look at the user guide where it says under the paragraph for nested parameters:
Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to
'passthrough'
In your case, I would define a grid with a list of two dictionaries, one in case the whole pipeline is used, and one where the PCA
is omitted:
param_grid = [
{
'pca__n_components': [3, 5, 7],
'gbc__n_estimators': [50, 100]
},
{
'pca': ['passthrough'], # skip the PCA
'gbc__n_estimators': [50, 100]
}
]
GridSearchCV
will now span the grids according to each dictionary in the list and try combinations with and without PCA
.