tl;dr: Is there any way to call .get_feature_names()
on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?
I have a Pipeline
that includes fitting and transforming text data with TfidfVectorizer
, and then runs a RandomForestClassifier
. I want to GridSearchCV
across various levels of max_features
in the classifier, based on the number of features that the transformation produced from the text.
#setup pipeline
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', RandomForestClassifier(random_state=1,
criterion='entropy',
n_estimators=800))
])
#setup parameter grid
params = {
'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}
NameError: name 'vect' is not defined
This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer
such as ngram_range
, one could see how this could change the number of features output to the next step...
The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.
I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features
, i.e. a percentage of the columns coming out of the vectorizer.
If you really want a score for every integer max_features
, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit
is called:
estimator = RandomForestClassifier(
random_state=1,
criterion='entropy',
n_estimators=800
)
class MySearcher(GridSearchCV):
def fit(self, X, y):
m = X.shape[1]
self.param_grid = {'max_features': np.arange(1, m, 1)}
return super().fit(X, y)
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', MySearcher(estimator=estimator,
param_grid={'fake': ['passes', 'check']}))
])
Now the search results will be awkwardly nested (best values of, say, ngram_range
give you a refitted copy of pipe
, whose second step will itself have a best value of max_features
and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.