Search code examples
pythonscikit-learngridsearchcv

How do I make the Pipeline skip the step (using "passthrough") and all params that applies to that step in param_grid?


I'm creating a pipeline in sklearn using PCA and skipping this step using "passthrough". For PCA I'm testing several values of the n_components parameter.

from sklearn.datasets import make_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

X_train, y_train = make_regression(n_samples=100, n_features=10)


param_grid = {
    'reduce_dim': [PCA(), 'passthrough'],
    'reduce_dim__n_components': [1,2,3]
}

pipeline = Pipeline(
        steps=[
            ('reduce_dim', None), 
            ('regressor', LinearRegression())
        ]
    )

grid_search = GridSearchCV(
    estimator=pipeline, 
    param_grid=param_grid, 
    verbose=10
)
grid_search.fit(X_train, y_train)

What I want to achieve is 3 fits for PCA with n_components=[1,2,3] and 1 fit without PCA.

Fitting 5 folds for each of 4 candidates, totalling 20 fits

What I get is 3 fits for PCA and 3 fits without PCA (I don't need to test all three possibilities of n_components without PCA):

Fitting 5 folds for each of 6 candidates, totalling 30 fits

and then a runtime error which basically says that I cannot assign n_components value to "passthrough" (str object)

[CV 1/5; 4/6] START reduce_dim=passthrough, reduce_dim__n_components=1...
AttributeError: 'str' object has no attribute 'set_params'

How do I make the pipeline skip the step (reduce_dim in that case) and all params that applies to that step?

I know that I can use param_grid like this:

param_grid = [
    {
        'reduce_dim': [PCA()],
        'reduce_dim__n_components': [1,2,3]
    },
    {}
]

but can it be done in a more elegant way, because in more complex scenarios the code is getting really messy.


Solution

  • The parameter grid you want can also be defined in a single dictionary for a single parameter:

    param_grid = {
        'reduce_dim' = [PCA(n_components=1), PCA(n_components=2), PCA(n_components=3), 'passthrough']
    }
    

    This has the advantage of avoiding the need to define several dictionaries which might be less "messy".