Search code examples
scikit-learnsvmpipelinegrid-search

Using Pipeline with GridSearchCV


Suppose I have this Pipeline object:

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('my_transform', my_transform()),
    ('estimator', SVC())
])

To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:

pipe_parameters = {
    'estimator__gamma': (0.1, 1),
    'estimator__kernel': (rbf)
}

Then, I could use GridSearchCV:

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)

We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?

For example, In a simple GridSearch (without Pipeline) I could do:

param_grid = [
    {'C': [ 0.1, 1, 10, 100, 1000], 
     'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
     'kernel': ['rbf']},
    {'C': [0.1, 1, 10, 100, 1000],
     'kernel': ['linear']},
    {'C': [0.1, 1, 10, 100, 1000],
     'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
     'degree': [2, 3],
     'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)

Therefore, I need a working version of this sort of code:

pipe_parameters = {
    'bag_of_words__max_features': (None, 1500),
    'estimator__kernel': (rbf),
    'estimator__gamma': (0.1, 1),
    'estimator__kernel': (linear),
    'estimator__C': (0.1, 1),
}

Meaning that I want to use as hyperparameters the following combinations:

kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1

Solution

  • You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.

    Try this example:

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.pipeline import pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.svm import SVC
    
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]
    remove = ('headers', 'footers', 'quotes')
    
    data_train = fetch_20newsgroups(subset='train', categories=categories,
                                    shuffle=True, random_state=42,
                                    remove=remove)
    
    pipe = Pipeline([
        ('bag_of_words', CountVectorizer()),
        ('estimator', SVC())])
    pipe_parameters = [
        {'bag_of_words__max_features': (None, 1500),
         'estimator__C': [ 0.1, ], 
         'estimator__gamma': [0.0001, 1],
         'estimator__kernel': ['rbf']},
        {'bag_of_words__max_features': (None, 1500),
         'estimator__C': [0.1, 1],
         'estimator__kernel': ['linear']}
    ]
    from sklearn.model_selection import GridSearchCV
    grid = GridSearchCV(pipe, pipe_parameters, cv=2)
    grid.fit(data_train.data, data_train.target)
    
    grid.best_params_
    # {'bag_of_words__max_features': None,
    #  'estimator__C': 0.1,
    #  'estimator__kernel': 'linear'}