python pipeline random-forest tfidfvectorizer gridsearchcv

Pass object attribute from previous sklearn pipeline step as argument to next step method

tl;dr: Is there any way to call .get_feature_names() on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?

I have a Pipeline that includes fitting and transforming text data with TfidfVectorizer, and then runs a RandomForestClassifier. I want to GridSearchCV across various levels of max_features in the classifier, based on the number of features that the transformation produced from the text.

#setup pipeline
pipe = Pipeline([
    ('vect', TfidfVectorizer(max_df=.4,
                            min_df=3,
                            norm='l1',
                            stop_words='english',
                            use_idf=False)),
    ('rf', RandomForestClassifier(random_state=1,
                                  criterion='entropy',
                                  n_estimators=800))
])

#setup parameter grid
params = {
    'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}

Instantiating returns the following error:

NameError: name 'vect' is not defined

Edit:

This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer such as ngram_range, one could see how this could change the number of features output to the next step...

Solution

The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.

I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features, i.e. a percentage of the columns coming out of the vectorizer.

If you really want a score for every integer max_features, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit is called:

estimator = RandomForestClassifier(
    random_state=1,
    criterion='entropy',
    n_estimators=800
    )

class MySearcher(GridSearchCV):
    def fit(self, X, y):
        m = X.shape[1]
        self.param_grid = {'max_features': np.arange(1, m, 1)}
        return super().fit(X, y)

pipe = Pipeline([
    ('vect', TfidfVectorizer(max_df=.4,
                             min_df=3,
                             norm='l1',
                             stop_words='english',
                             use_idf=False)),
    ('rf', MySearcher(estimator=estimator, 
                      param_grid={'fake': ['passes', 'check']}))
])

Now the search results will be awkwardly nested (best values of, say, ngram_range give you a refitted copy of pipe, whose second step will itself have a best value of max_features and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.