Search code examples
pythonscikit-learntext-classification

How to use pickled classifier with countVectorizer.fit_transform() for labeling data


I trained a classifier on a set of short documents and pickled it after getting the reasonable f1 and accuracy scores for a binary classification task.

While training, I reduced the number of features using a sciki-learn countVectorizer cv:

    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 

and then used the fit_transform() and transform() methods to obtain the transformed train and test sets:

    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))

    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()

This all worked fine for training and testing the classifier. However, I am not sure how to use fit_transform() and transform() with a pickled version of the trained classifier for predicting the label of unseen, unlabeled data.

I am extracting the features on the unlabeled data exactly the same way I was doing while training/testing the classifier:

## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)

## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)

transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()

## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)

Error message:

    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words

Solution

  • You should use the same vectorizer instance for transforming the training and test data. You can do that by creating a pipeline with the vectorizer + classifier, training the pipeline on the training set, pickling the whole pipeline. Later load the pickled pipeline and call predict on it.

    See this related question: Bringing a classifier to production.