I am trying to learn how to work with text data through sklearn and am running into an issue that I cannot solve.
The tutorial I'm following is: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
The input is a pandas df with two columns. One with text, one with a binary class.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)
mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
predicted = text_clf.predict(x_test_modified)
When I try to run the last line:
NotFittedError Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
304 for name, transform in self.steps[:-1]:
305 if transform is not None:
--> 306 Xt = transform.transform(Xt)
307 return self.steps[-1][-1].predict(Xt)
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
--> 920 self._check_vocabulary()
922 # use the same matrix-building strategy as fit_transform
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
305 if len(self.vocabulary_) == 0:
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Any suggestions on how to fix this error? I am properly transforming the CV model on the test data. I even checked if the vocabulary list was empty and it isn't (count_vect.vocabulary_)
There are several issues with your question.
For starters, you don't actually fit the pipeline, hence the error. Looking more closely in the linked tutorial, you'll see that there is a step text_clf.fit
(where text_clf
is indeed the pipeline).
Second, you don't use the notion of the pipeline correctly, which is exactly to fit end-to-end the whole stuff; instead, you fit the individual components of it one by one... If you check again the tutorial, you'll see that the code for the pipeline fit:
text_clf.fit(twenty_train.data, twenty_train.target)
uses the data in their initial form, not their intermediate transformations, as you do; the point of the tutorial is to demonstrate how the individual transformations can be wrapped-up in (and replaced by) a pipeline, not to use the pipeline on top of these transformations...
Third, you should avoid naming variables as fit
- this is a reserved keyword; and similarly, we don't use CV to abbreviate Count Vectorizer (in ML lingo, CV stands for cross validation).
That said, here is the correct way for using your pipeline:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
text_clf.fit(x_train, y_train)
predicted = text_clf.predict(x_test)
As you can see, the purpose of the pipelines is to make things simpler (compared to using the components one by one sequentially), not to complicate them further...