Search code examples
pythonscikit-learnpipelinecross-validationgrid-search

Alternate different models in Pipeline for GridSearchCV


I want to build a Pipeline in sklearn and test different models using GridSearchCV.

Just an example (please do not pay attention on what particular models are chosen):

reg = LogisticRegression()

proj1 = PCA(n_components=2)
proj2 = MDS()
proj3 = TSNE()

pipe = [('proj', proj1), ('reg' , reg)]

pipe = Pipeline(pipe)

param_grid = {
    'reg__c': [0.01, 0.1, 1],
}

clf = GridSearchCV(pipe, param_grid = param_grid)

Here if I want to try different models for dimensionality reduction, I need to code different pipelines and compare them manually. Is there an easy way to do it?

One solution I came up with is define my own class derived from base estimator:

class Projection(BaseEstimator):
    def __init__(self, est_name):
        if est_name == "MDS":
            self.model = MDS()
        ...
    ...
    def fit_transform(self, X):
        return self.model.fit_transform(X)

I think it will work, I just create a Projection object and pass it to Pipeline, using names of the estimators as parameters for it.

But to me this way is a bit chaotic and not scalable: it makes me to define new class each time I want to compare different models. Also to continue this solution, one could implement a class that does the same job, but with arbitrary set of models. It seems overcomplicated to me.

What is the most natural and pythonic way to compare different models?


Solution

  • Lets assume you want to use PCA and TruncatedSVD as your dimesionality reduction step.

    pca = decomposition.PCA()
    svd = decomposition.TruncatedSVD()
    svm = SVC()
    n_components = [20, 40, 64]
    

    You can do this:

    pipe = Pipeline(steps=[('reduction', pca), ('svm', svm)])
    
    # Change params_grid -> Instead of dict, make it a list of dict
    # In the first element, pass parameters related to pca, and in second related to svd
    
    params_grid = [{
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'reduction':pca,
    'reduction__n_components': n_components,
    },
    {
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'reduction':svd,
    'reduction__n_components': n_components,
    'reduction__algorithm':['randomized']
    }]
    

    and now just pass the pipeline object to gridsearchCV

    grd = GridSearchCV(pipe, param_grid = params_grid)
    

    Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

    Please look at my other answer for more details: "Parallel" pipeline to get best model using gridsearch