Search code examples
pythonpipelinerandom-foresttfidfvectorizergridsearchcv

Pass object attribute from previous sklearn pipeline step as argument to next step method


tl;dr: Is there any way to call .get_feature_names() on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?


I have a Pipeline that includes fitting and transforming text data with TfidfVectorizer, and then runs a RandomForestClassifier. I want to GridSearchCV across various levels of max_features in the classifier, based on the number of features that the transformation produced from the text.

#setup pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer(max_df=.4,
                            min_df=3,
                            norm='l1',
                            stop_words='english',
                            use_idf=False)),
    ('rf', RandomForestClassifier(random_state=1,
                                  criterion='entropy',
                                  n_estimators=800))
])

#setup parameter grid
params = {
    'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}

Instantiating returns the following error:

NameError: name 'vect' is not defined

Edit:

This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer such as ngram_range, one could see how this could change the number of features output to the next step...


Solution

  • The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.

    I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features, i.e. a percentage of the columns coming out of the vectorizer.

    If you really want a score for every integer max_features, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit is called:

    estimator = RandomForestClassifier(
        random_state=1,
        criterion='entropy',
        n_estimators=800
        )
    
    class MySearcher(GridSearchCV):
        def fit(self, X, y):
            m = X.shape[1]
            self.param_grid = {'max_features': np.arange(1, m, 1)}
            return super().fit(X, y)
    
    pipe = Pipeline([
        ('vect', TfidfVectorizer(max_df=.4,
                                 min_df=3,
                                 norm='l1',
                                 stop_words='english',
                                 use_idf=False)),
        ('rf', MySearcher(estimator=estimator, 
                          param_grid={'fake': ['passes', 'check']}))
    ])
    
    

    Now the search results will be awkwardly nested (best values of, say, ngram_range give you a refitted copy of pipe, whose second step will itself have a best value of max_features and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.