I'm creating a pipeline in sklearn using PCA and skipping this step using "passthrough".
For PCA I'm testing several values of the n_components
parameter.
from sklearn.datasets import make_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X_train, y_train = make_regression(n_samples=100, n_features=10)
param_grid = {
'reduce_dim': [PCA(), 'passthrough'],
'reduce_dim__n_components': [1,2,3]
}
pipeline = Pipeline(
steps=[
('reduce_dim', None),
('regressor', LinearRegression())
]
)
grid_search = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
verbose=10
)
grid_search.fit(X_train, y_train)
What I want to achieve is 3 fits for PCA with n_components=[1,2,3]
and 1 fit without PCA.
Fitting 5 folds for each of 4 candidates, totalling 20 fits
What I get is 3 fits for PCA and 3 fits without PCA (I don't need to test all three possibilities of n_components
without PCA):
Fitting 5 folds for each of 6 candidates, totalling 30 fits
and then a runtime error which basically says that I cannot assign n_components value to "passthrough" (str object)
[CV 1/5; 4/6] START reduce_dim=passthrough, reduce_dim__n_components=1...
AttributeError: 'str' object has no attribute 'set_params'
How do I make the pipeline skip the step (reduce_dim
in that case) and all params that applies to that step?
I know that I can use param_grid like this:
param_grid = [
{
'reduce_dim': [PCA()],
'reduce_dim__n_components': [1,2,3]
},
{}
]
but can it be done in a more elegant way, because in more complex scenarios the code is getting really messy.
The parameter grid you want can also be defined in a single dictionary for a single parameter:
param_grid = {
'reduce_dim' = [PCA(n_components=1), PCA(n_components=2), PCA(n_components=3), 'passthrough']
}
This has the advantage of avoiding the need to define several dictionaries which might be less "messy".