machine-learning scikit-learn nlp tf-idf countvectorizer

Sklearn NotFittedError for CountVectorizer in pipeline

I am trying to learn how to work with text data through sklearn and am running into an issue that I cannot solve.

The tutorial I'm following is: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

The input is a pandas df with two columns. One with text, one with a binary class.

Code:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])

x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']

# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)


# TF-IDF
idf = TfidfTransformer()
fit = idf.fit(x_train_modified)
x_train_mod2 = fit.transform(x_train_modified)

# MNB

mnb = MultinomialNB()
x_train_data = mnb.fit(x_train_mod2, y_train)

text_clf = Pipeline([('vect', CountVectorizer()),
             ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
                ])

predicted = text_clf.predict(x_test_modified)

When I try to run the last line:

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-64-8815003b4713> in <module>()
----> 1 predicted = text_clf.predict(x_test_modified)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    113 
    114         # lambda, but not partial, allows help() to work with update_wrapper
--> 115         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    116         # update the docstring of the returned function
    117         update_wrapper(out, self.fn)

~/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in predict(self, X)
    304         for name, transform in self.steps[:-1]:
    305             if transform is not None:
--> 306                 Xt = transform.transform(Xt)
    307         return self.steps[-1][-1].predict(Xt)
    308 

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
    918             self._validate_vocabulary()
    919 
--> 920         self._check_vocabulary()
    921 
    922         # use the same matrix-building strategy as fit_transform

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _check_vocabulary(self)
    301         """Check if vocabulary is empty or missing (not fit-ed)"""
    302         msg = "%(name)s - Vocabulary wasn't fitted."
--> 303         check_is_fitted(self, 'vocabulary_', msg=msg),
    304 
    305         if len(self.vocabulary_) == 0:

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    766 
    767     if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768         raise NotFittedError(msg % {'name': type(estimator).__name__})
    769 
    770 

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

Any suggestions on how to fix this error? I am properly transforming the CV model on the test data. I even checked if the vocabulary list was empty and it isn't (count_vect.vocabulary_)

Thank you!

Solution

There are several issues with your question.

For starters, you don't actually fit the pipeline, hence the error. Looking more closely in the linked tutorial, you'll see that there is a step text_clf.fit (where text_clf is indeed the pipeline).

Second, you don't use the notion of the pipeline correctly, which is exactly to fit end-to-end the whole stuff; instead, you fit the individual components of it one by one... If you check again the tutorial, you'll see that the code for the pipeline fit:

text_clf.fit(twenty_train.data, twenty_train.target)

uses the data in their initial form, not their intermediate transformations, as you do; the point of the tutorial is to demonstrate how the individual transformations can be wrapped-up in (and replaced by) a pipeline, not to use the pipeline on top of these transformations...

Third, you should avoid naming variables as fit - this is a reserved keyword; and similarly, we don't use CV to abbreviate Count Vectorizer (in ML lingo, CV stands for cross validation).

That said, here is the correct way for using your pipeline:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])

x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB()),
                     ])

text_clf.fit(x_train, y_train) 

predicted = text_clf.predict(x_test)

As you can see, the purpose of the pipelines is to make things simpler (compared to using the components one by one sequentially), not to complicate them further...