Sklearn NotFittedError for CountVectorizer in pipeline

I am trying to learn how to work with text data through sklearn and am running into an issue that I cannot solve.

The tutorial I'm following is:

The input is a pandas df with two columns. One with text, one with a binary class.


from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])

x_train = traindf['text']
x_test = traindf['text']
y_train = traindf['class']
y_test = testdf['class']

# CV
count_vect = CountVectorizer(stop_words='english')
x_train_modified = count_vect.fit_transform(x_train)
x_test_modified = count_vect.transform(x_test)

idf = TfidfTransformer()
fit =
x_train_mod2 = fit.transform(x_train_modified)


mnb = MultinomialNB()
x_train_data =, y_train)

text_clf = Pipeline([('vect', CountVectorizer()),
             ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),

predicted = text_clf.predict(x_test_modified)

When I try to run the last line:

Any suggestions on how to fix this error? I am properly transforming the CV model on the test data. I even checked if the vocabulary list was empty and it isn't (count_vect.vocabulary_)

Thank you!


  • There are several issues with your question.

    For starters, you don't actually fit the pipeline, hence the error. Looking more closely in the linked tutorial, you'll see that there is a step (where text_clf is indeed the pipeline).

    Second, you don't use the notion of the pipeline correctly, which is exactly to fit end-to-end the whole stuff; instead, you fit the individual components of it one by one... If you check again the tutorial, you'll see that the code for the pipeline fit:,  

    uses the data in their initial form, not their intermediate transformations, as you do; the point of the tutorial is to demonstrate how the individual transformations can be wrapped-up in (and replaced by) a pipeline, not to use the pipeline on top of these transformations...

    Third, you should avoid naming variables as fit - this is a reserved keyword; and similarly, we don't use CV to abbreviate Count Vectorizer (in ML lingo, CV stands for cross validation).

    That said, here is the correct way for using your pipeline:

    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import Pipeline
    traindf, testdf = train_test_split(nlp_df, stratify=nlp_df['class'])
    x_train = traindf['text']
    x_test = traindf['text']
    y_train = traindf['class']
    y_test = testdf['class']
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                        ('tfidf', TfidfTransformer()),
                        ('clf', MultinomialNB()),
                         ]), y_train) 
    predicted = text_clf.predict(x_test)

    As you can see, the purpose of the pipelines is to make things simpler (compared to using the components one by one sequentially), not to complicate them further...