Search code examples
scikit-learnpipelinetransformer-model

Sklearn Pipeline : pass a parameter to a custom Transformer?


I have a custom Transformer in my sklearn Pipeline and I wonder how to pass a parameter to my Transformer :

In the code below, you can see that I use a dictionary "weight" in my Transformer. I wish to not define this dictionary inside my Transformer but instead to pass it from the Pipeline, so that I can include this dictionary in a grid search . Is it possible to pass the dictionary as a parameter to my Transformer ?

# My custom Transformer
  class TextExtractor(BaseEstimator, TransformerMixin):
        """Concat the 'title', 'body' and 'code' from the results of 
        Stackoverflow query
        Keys are 'title', 'body' and 'code'.
        """
        def fit(self, x, y=None):
            return self

        def transform(self, x):
            # here is the parameter  I want to pass to my transformer
            weight ={'title' : 10, 'body': 1, 'code' : 1}
            x['text'] = weight['title']*x['Title'] +  
            weight['body']*x['Body'] +  
            weight['code']*x['Code']

            return x['text']

param_grid = {
    'min_df' : [10],
    'max_df' : [0.01],
    'max_features': [200],
    'clf' : [sgd]
    # here is the parameter  I want to pass to my transformer
    'weigth' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
     1, 'code' : 1}]

}

for g in ParameterGrid(param_grid) :   

    classifier_pipe = Pipeline(

    steps=[    ('textextractor', TextExtractor()), #is it possible to pass 
                my parameter ?
               ('vectorizer', TfidfVectorizer(max_df=g['max_df'], 
                     min_df=g['min_df'], max_features=g['max_features'])),
               ('clf', g['clf']), 
            ],
    )

Solution

  • For this, you just need to add an __init__() method at the beginning of your class definition. In this step, you will define your class TextExtractor as taking an argument that you call weight.

    Here is how it can be done: (I added lots of lines of code before for the sake of reproducibility - given you did not specify anything I made up some fake data. I also assumed that what you are trying to do with the weights is to multiply strings?)

    # import all the necessary packages
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import ParameterGrid, GridSearchCV
    from sklearn.linear_model import SGDClassifier
    
    import pandas as pd
    import numpy as np
    
    #Sample data
    X = pd.DataFrame({"Title" : ["T1","T2","T3","T4","T5"], "Body": ["B1","B2","B3","B4","B5"], "Code": ["C1","C2","C3","C4","C5"]})
    y = np.array([0,0,1,1,1])
    
    #Define the SGDClassifier
    sgd = SGDClassifier()
    

    Below, I only added the init step:

    # My custom Transformer
    
    class TextExtractor(BaseEstimator, TransformerMixin):
        """Concat the 'title', 'body' and 'code' from the results of 
        Stackoverflow query
        Keys are 'title', 'body' and 'code'.
    
    
        """
    
        def __init__(self, weight = {'title' : 10, 'body': 1, 'code' : 1}):
    
            self.weight = weight
    
        def fit(self, x, y=None):
            return self
    
        def transform(self, x):
    
            x['text'] = self.weight['title']*x['Title'] + self.weight['body']*x['Body'] + self.weight['code']*x['Code']
    
            return x['text']
    

    Note that I passed a parameter value by default in the case you don't specify it. This is up to you. Then you can call your transformer by doing:

    textextractor = TextExtractor(weight = {'title' : 5, 'body': 2, 'code' : 1})
    textextractor.transform(X)
    

    This should return:

    0    T1T1T1T1T1B1B1C1
    1    T2T2T2T2T2B2B2C2
    2    T3T3T3T3T3B3B3C3
    3    T4T4T4T4T4B4B4C4
    4    T5T5T5T5T5B5B5C5
    

    Then you can define your parameter grid:

    param_grid = {
    'vectorizer__min_df' : [0.1],
    'vectorizer__max_df' : [0.9],
    'vectorizer__max_features': [200],
    # here is the parameter  I want to pass to my transformer
    'textextractor__weight' : [{'title' : 10, 'body': 1, 'code' : 1}, {'title' : 1, 'body': 
     1, 'code' : 1}]
    }
    

    And finally do:

    for g in ParameterGrid(param_grid) :   
    
    classifier_pipe = Pipeline(
    
    steps=[    ('textextractor', TextExtractor(weight = g['textextractor__weight'])), 
               ('vectorizer', TfidfVectorizer(max_df=g['vectorizer__max_df'], 
                     min_df=g['vectorizer__min_df'], max_features=g['vectorizer__max_features'])),
               ('clf', sgd),  ] )
    

    Instead of this, you might want to do a gridsearch, which then would require you to write:

    pipe = Pipeline( steps=[    ('textextractor', TextExtractor()), 
               ('vectorizer', TfidfVectorizer()),
               ('clf', sgd) ] )
    grid = GridSearchCV(pipe, param_grid, cv = 3)
    grid.fit(X,y)