python scikit-learn pipeline cross-validation grid-search

Alternate different models in Pipeline for GridSearchCV

I want to build a Pipeline in sklearn and test different models using GridSearchCV.

Just an example (please do not pay attention on what particular models are chosen):

reg = LogisticRegression()

proj1 = PCA(n_components=2)
proj2 = MDS()
proj3 = TSNE()

pipe = [('proj', proj1), ('reg' , reg)]

pipe = Pipeline(pipe)

param_grid = {
    'reg__c': [0.01, 0.1, 1],
}

clf = GridSearchCV(pipe, param_grid = param_grid)

Here if I want to try different models for dimensionality reduction, I need to code different pipelines and compare them manually. Is there an easy way to do it?

One solution I came up with is define my own class derived from base estimator:

class Projection(BaseEstimator):
    def __init__(self, est_name):
        if est_name == "MDS":
            self.model = MDS()
        ...
    ...
    def fit_transform(self, X):
        return self.model.fit_transform(X)

I think it will work, I just create a Projection object and pass it to Pipeline, using names of the estimators as parameters for it.

But to me this way is a bit chaotic and not scalable: it makes me to define new class each time I want to compare different models. Also to continue this solution, one could implement a class that does the same job, but with arbitrary set of models. It seems overcomplicated to me.

What is the most natural and pythonic way to compare different models?

Solution

Lets assume you want to use PCA and TruncatedSVD as your dimesionality reduction step.

pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]

You can do this:

pipe = Pipeline(steps=[('reduction', pca), ('svm', svm)])

# Change params_grid -> Instead of dict, make it a list of dict
# In the first element, pass parameters related to pca, and in second related to svd

params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':pca,
'reduction__n_components': n_components,
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'reduction':svd,
'reduction__n_components': n_components,
'reduction__algorithm':['randomized']
}]

and now just pass the pipeline object to gridsearchCV

grd = GridSearchCV(pipe, param_grid = params_grid)

Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

Please look at my other answer for more details: "Parallel" pipeline to get best model using gridsearch