I already trained a model for topic classification. Then when I am going to transform new data into vectors for prediction, it going wrong. It shows "NotFittedError: CountVectorizer - Vocabulary wasn't fitted." But when I did the prediction by splitting training data into test data in trained model, it works. Here are the code:
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')
testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords 1800 non-null object
topics 1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB
# read columns
kw = testdf['keywords']
label = testdf['topics']
# 將預測資料轉為向量
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)
Here is an error
---------------------------------------------------------------------------
NotFittedError Traceback (most recent call last)
<ipython-input-93-cfcc7201e0f8> in <module>()
1 # 將預測資料轉為向量
2 vectorizer = CountVectorizer(min_df=1, stop_words='english')
----> 3 x_testkw_vec = vectorizer.transform(kw)
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in transform(self, raw_documents)
918 self._validate_vocabulary()
919
--> 920 self._check_vocabulary()
921
922 # use the same matrix-building strategy as fit_transform
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\feature_extraction\text.py in _check_vocabulary(self)
301 """Check if vocabulary is empty or missing (not fit-ed)"""
302 msg = "%(name)s - Vocabulary wasn't fitted."
--> 303 check_is_fitted(self, 'vocabulary_', msg=msg),
304
305 if len(self.vocabulary_) == 0:
~\Anaconda3\envs\ztdl\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
You need to call vectorizer.fit()
for the count vectorizer to build the dictionary of words before calling vectorizer.transform()
. You can also just call vectorizer.fit_transform()
that combines both.
But you should not be using a new vectorizer for test or any kind of inference. You need to use the same one you used when training the model, or your results will be random since vocabularies are different (lacking some words, does not have the same alignment etc..)
For that, you can just pickle the vectorizer used in the training and load it on inference/test time.