python scikit-learn naivebayes sklearn-pandas

Getting dimension mismatch error when i try to predict with naive bayes / Python

I've created a sentiment script and use Naive Bayes to classify the reviews. I trained and tested my model and saved it in a Pickle object. Now I would like to perform on a new dataset my prediction but I always get following error message

raise ValueError('dimension mismatch') ValueError: dimension mismatch

It pops up on this line:

preds = nb.predict(transformed_review)[0]

Can anyone tell me if I'm doing something wrong? I do not understand the error.

This is my Skript:

sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')] 

ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime

ta_review = pd.read_csv(review_akt_doc) 
sentiment_de_class= ta_review

x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']

def text_process(text):
    nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc)) 

    ## stemming
    stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)

    stop = [word for word in stemmi.split() if word.lower() not in stopwords]
    stop = ' '.join(stop)

    return [word for word in stemmi.split() if word.lower() not in stopwords]


######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)

######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)


print 'starting training ..'

######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)

######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)

predis = []
######################
# Classify 
######################
cols = ['SENTIMENT_CLASSIFY']

for sentiment in sentiment_de_class['REV']:
    transformed_review = bow_transformer.transform([sentiment])
    preds = nb.predict(transformed_review)[0]  ##right here I get the error
    predis.append(preds)

df = pd.DataFrame(predis, columns=cols)

Solution

You need to save the CountVectorizer object too just as you are saving the nb.

When you call

CountVectorizer(analyzer=text_process).fit(x)

you are re-training the CountVectorizer on new data, so the features (vocabulary) found by it will be different than at the training time and hence the saved nb which was trained on the earlier features complain about dimension mismatch.

Better to pickle them in different files, but if you want you can save them in same file.

To pickle both in same object:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file)  <=== Add this
pickle.dump(nb, file)

To read both in next call:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)

Please look at this answer for more detail: https://stackoverflow.com/a/15463472/3374996