Sklearn Pipeline : pass a parameter to a custom Transformer?

I have a custom Transformer in my sklearn Pipeline and I wonder how to pass a parameter to my Transformer :

In the code below, you can see that I use a dictionary "weight" in my Transformer. I wish to not define this dictionary inside my Transformer but instead to pass it from the Pipeline, so that I can include this dictionary in a grid search . Is it possible to pass the dictionary as a parameter to my Transformer ?

# My custom Transformer
  class TextExtractor(BaseEstimator, TransformerMixin):
        """Concat the 'title', 'body' and 'code' from the results of 
        Stackoverflow query
        Keys are 'title', 'body' and 'code'.
        """
        def fit(self, x, y=None):
            return self

        def transform(self, x):
            # here is the parameter  I want to pass to my transformer
            weight ={'title' : 10, 'body': 1, 'code' : 1}
            x['text'] = weight['title']*x['Title'] +  
            weight['body']*x['Body'] +  
            weight['code']*x['Code']

            return x['text']

param_grid = {
    'min_df' : [10],
    'max_df' : [0.01],
    'max_features': [200],
    'clf' : [sgd]
    # here is the parameter  I want to pass to my transformer
    'weigth' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
     1, 'code' : 1}]

}

for g in ParameterGrid(param_grid) :   

    classifier_pipe = Pipeline(

    steps=[    ('textextractor', TextExtractor()), #is it possible to pass 
                my parameter ?
               ('vectorizer', TfidfVectorizer(max_df=g['max_df'], 
                     min_df=g['min_df'], max_features=g['max_features'])),
               ('clf', g['clf']), 
            ],
    )

Solution

For this, you just need to add an __init__() method at the beginning of your class definition. In this step, you will define your class TextExtractor as taking an argument that you call weight.

Here is how it can be done: (I added lots of lines of code before for the sake of reproducibility - given you did not specify anything I made up some fake data. I also assumed that what you are trying to do with the weights is to multiply strings?)

# import all the necessary packages
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ParameterGrid, GridSearchCV
from sklearn.linear_model import SGDClassifier

import pandas as pd
import numpy as np

#Sample data
X = pd.DataFrame({"Title" : ["T1","T2","T3","T4","T5"], "Body": ["B1","B2","B3","B4","B5"], "Code": ["C1","C2","C3","C4","C5"]})
y = np.array([0,0,1,1,1])

#Define the SGDClassifier
sgd = SGDClassifier()

Below, I only added the init step:

# My custom Transformer

class TextExtractor(BaseEstimator, TransformerMixin):
    """Concat the 'title', 'body' and 'code' from the results of 
    Stackoverflow query
    Keys are 'title', 'body' and 'code'.


    """

    def __init__(self, weight = {'title' : 10, 'body': 1, 'code' : 1}):

        self.weight = weight

    def fit(self, x, y=None):
        return self

    def transform(self, x):

        x['text'] = self.weight['title']*x['Title'] + self.weight['body']*x['Body'] + self.weight['code']*x['Code']

        return x['text']

Note that I passed a parameter value by default in the case you don't specify it. This is up to you. Then you can call your transformer by doing:

textextractor = TextExtractor(weight = {'title' : 5, 'body': 2, 'code' : 1})
textextractor.transform(X)

This should return:

0    T1T1T1T1T1B1B1C1
1    T2T2T2T2T2B2B2C2
2    T3T3T3T3T3B3B3C3
3    T4T4T4T4T4B4B4C4
4    T5T5T5T5T5B5B5C5

Then you can define your parameter grid:

param_grid = {
'vectorizer__min_df' : [0.1],
'vectorizer__max_df' : [0.9],
'vectorizer__max_features': [200],
# here is the parameter  I want to pass to my transformer
'textextractor__weight' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
 1, 'code' : 1}]
}

And finally do:

for g in ParameterGrid(param_grid) :   

classifier_pipe = Pipeline(

steps=[    ('textextractor', TextExtractor(weight = g['textextractor__weight'])), 
           ('vectorizer', TfidfVectorizer(max_df=g['vectorizer__max_df'], 
                 min_df=g['vectorizer__min_df'], max_features=g['vectorizer__max_features'])),
           ('clf', sgd),  ] )

Instead of this, you might want to do a gridsearch, which then would require you to write:

pipe = Pipeline( steps=[    ('textextractor', TextExtractor()), 
           ('vectorizer', TfidfVectorizer()),
           ('clf', sgd) ] )
grid = GridSearchCV(pipe, param_grid, cv = 3)
grid.fit(X,y)