Search code examples
pythonpython-3.xscikit-learnpipelinegridsearchcv

How to use GridSearchCV for comparing multiple models along with pipeline and hyper-parameter tuning in python


I am using two estimators, Randomforest and SVM

random_forest_pipeline=Pipeline([   
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('svm',LinearSVC())
])

I want to first vectorize the data and then use the estimator, I was going through this online tutorial . then I use the hyper parameter as follows

parameters=[
    {
        'vectorizer__max_features':[500,1000,1500],
        'random_forest__min_samples_split':[50,100,250,500]
    },
    {
        'vectorizer__max_features':[500,1000,1500],
        'svm__C':[1,3,5]
    }
]

and passed to the GridSearchCV

pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)

but, when I run the code I get an error

TypeError: estimator should be an estimator implementing 'fit' method

Don't know why am I getting this error


Solution

  • The problem is the pipelines=[random_forest_pipeline,svm_pipeline] that is a list not having the fit method.

    Even if you could make it work this way, at some point the 'random_forest__min_samples_split':[50,100,250,500] would be passed in the svm_pipeline and this would raise an error.

    ValueError: Invalid parameter svm for estimator Pipeline

    You cannot mix this way 2 pipelines because at some point you request the svm_pipeline to be evaluated using the values of random_forest__min_samples_split and this is INVALID.


    Solution: Fit a GridSearch object for the Random forest model and another GridSearch object for the SVC model

    pipelines=[random_forest_pipeline,svm_pipeline]
    
    grid_search_1=GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
    grid_search_1.fit(X,y)
    
    grid_search_2=GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
    grid_search_2.fit(X,y)
    

    Full code:

    random_forest_pipeline=Pipeline([   
        ('vectorizer',CountVectorizer(stop_words='english')),
        ('random_forest',RandomForestClassifier())
    ])
    svm_pipeline=Pipeline([
        ('vectorizer',CountVectorizer(stop_words='english')),
        ('svm',LinearSVC())
    ])
    
    parameters=[
        {
            'vectorizer__max_features':[500,1000,1500],
            'random_forest__min_samples_split':[50,100,250,500]
        },
        {
            'vectorizer__max_features':[500,1000,1500],
            'svm__C':[1,3,5]
        }
    ]
    
    pipelines=[random_forest_pipeline,svm_pipeline]
    
    # gridsearch only for the Random Forest model
    grid_search_1 =GridSearchCV(pipelines[0],param_grid=parameters[0],cv=3,n_jobs=-1)
    grid_search_1.fit(X,y)
    
    # gridsearch only for the SVC model
    grid_search_2 =GridSearchCV(pipelines[1],param_grid=parameters[1],cv=3,n_jobs=-1)
    grid_search_2.fit(X,y)
    

    EDIT

    If you explicitly define the models into the param_grid list then it is possible based on the documentation.

    Link: https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html?highlight=pipeline%20gridsearch

    Code from doc:

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_digits
    from sklearn.model_selection import GridSearchCV
    from sklearn.pipeline import Pipeline
    from sklearn.svm import LinearSVC
    from sklearn.decomposition import PCA, NMF
    from sklearn.feature_selection import SelectKBest, chi2
    
    print(__doc__)
    
    pipe = Pipeline([
        # the reduce_dim stage is populated by the param_grid
        ('reduce_dim', 'passthrough'),
        ('classify', LinearSVC(dual=False, max_iter=10000))
    ])
    
    N_FEATURES_OPTIONS = [2, 4, 8]
    C_OPTIONS = [1, 10, 100, 1000]
    param_grid = [
        {
            'reduce_dim': [PCA(iterated_power=7), NMF()],
            'reduce_dim__n_components': N_FEATURES_OPTIONS,
            'classify__C': C_OPTIONS
        },
        {
            'reduce_dim': [SelectKBest(chi2)],
            'reduce_dim__k': N_FEATURES_OPTIONS,
            'classify__C': C_OPTIONS
        },
    ]
    reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
    
    grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
    X, y = load_digits(return_X_y=True)
    grid.fit(X, y)