I trained a classifier on a set of short documents and pickled it after getting the reasonable f1 and accuracy scores for a binary classification task.
While training, I reduced the number of features using a sciki-learn countVectorizer
cv:
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
and then used the fit_transform()
and transform()
methods to obtain the transformed train and test sets:
transformed_feat_train = numpy.zeros((0,0,))
transformed_feat_test = numpy.zeros((0,0,))
transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
transformed_feat_test = cv.transform(testingTextFeat).toarray()
This all worked fine for training and testing the classifier. However, I am not sure how to use fit_transform()
and transform()
with a pickled version of the trained classifier for predicting the label of unseen, unlabeled data.
I am extracting the features on the unlabeled data exactly the same way I was doing while training/testing the classifier:
## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)
## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)
transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()
## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
Error message:
Traceback (most recent call last):
File "../clf.py", line 615, in <module>
if __name__=="__main__": main()
File "../clf.py", line 579, in main
cv.fit_transform(pickledClassifierFile)
File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
You should use the same vectorizer instance for transforming the training and test data. You can do that by creating a pipeline with the vectorizer + classifier, training the pipeline on the training set, pickling the whole pipeline. Later load the pickled pipeline and call predict on it.
See this related question: Bringing a classifier to production.