I am trying to learn how to work with text data through sklearn and am running into an issue that I cannot solve.
The tutorial I'm following is: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
The input is a pandas df with two columns. One with text, one with a binary class.
Code:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)
# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)
# MNB
mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
predicted = text_clf.predict(x_test_modified)
When I try to run the last line:
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
113
114 # lambda, but not partial, allows help() to work with update_wrapper
--> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
116 # update the docstring of the returned function
117 update_wrapper(out, self.fn)
~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
304 for name, transform in self.steps[:-1]:
305 if transform is not None:
--> 306 Xt = transform.transform(Xt)
307 return self.steps[-1][-1].predict(Xt)
308
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
Any suggestions on how to fix this error? I am properly transforming the CV model on the test data. I even checked if the vocabulary list was empty and it isn't (count_vect.vocabulary_)
Thank you!
There are several issues with your question.
For starters, you don't actually fit the pipeline, hence the error. Looking more closely in the linked tutorial, you'll see that there is a step text_clf.fit
(where text_clf
is indeed the pipeline).
Second, you don't use the notion of the pipeline correctly, which is exactly to fit end-to-end the whole stuff; instead, you fit the individual components of it one by one... If you check again the tutorial, you'll see that the code for the pipeline fit:
text_clf.fit(twenty_train.data, twenty_train.target)
uses the data in their initial form, not their intermediate transformations, as you do; the point of the tutorial is to demonstrate how the individual transformations can be wrapped-up in (and replaced by) a pipeline, not to use the pipeline on top of these transformations...
Third, you should avoid naming variables as fit
- this is a reserved keyword; and similarly, we don't use CV to abbreviate Count Vectorizer (in ML lingo, CV stands for cross validation).
That said, here is the correct way for using your pipeline:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(x_train, y_train)
predicted = text_clf.predict(x_test)
As you can see, the purpose of the pipelines is to make things simpler (compared to using the components one by one sequentially), not to complicate them further...