Search code examples
pythonscikit-learnnlpsentiment-analysiscountvectorizer

NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysis


while performing sentiment analysis using data -

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

The dataset contains 25K training and testing data (12.5 Positive and 12.5 Negative reviews) I'm constantly getting -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

Code -

(Required libraries and Variable names are initialized separately)

To create training and testing data -

import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob.glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r',encoding="utf8").read())
            labels.append(idx)
    # stored as np.int8 to save space 
    return texts, np.array(labels).astype(np.int8)

trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)

len(trn),len(trn_y),len(val),len(val_y)

len(trn_y[trn_y==1]),len(val_y[val_y==1])

np.unique(trn_y)

Count Vectorization -

re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)


trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]

Here i get Error -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

Solution

  • vocab = loaded_vectorizer.get_feature_names()
    

    loaded_vectorizer is not defined anywhere in this code, so it's not surprising that it's not initialized.

    Also why do you initialize veczr twice? Apparently you don't use it the second time.