I'm trying to do MultinomialNB
. I got Value Error: dimension mismatch
.
I'm using DictVectorizer
for the training data and LabelEncoder
for the class.
This is my code:
def create_token(inpt):
return inpt.split(' ')
def tok_freq(inpt):
tok = {}
for i in create_token(inpt):
if i not in tok:
tok[i] = 1
else:
tok[i] += 1
return tok
training_data = []
for i in range(len(raw_data)):
training_data.append((get_freq_of_tokens(raw_data.iloc[i].text), raw_data.iloc[i].category))
#vectorization
X, y = list(zip(*training_data))
label = LabelEncoder()
vector = DictVectorizer(dtype=float, sparse=True)
X = vector.fit_transform(X)
y = label.fit_transform(y)
multinb = mnb()
multinb.fit(X,y)
#vectorization for testing set
Xz = tok_freq(sms)
testX = vector.fit_transform(Xz)
multinb.predict(testX)
Which part of my code is wrong? Thanks.
Change
testX = vector.fit_transform(Xz)
to:
testX = vector.transform(Xz)
When you do fit()
or fit_transform()
, you are essentially training the vectorizer on the new data, which is not what you want. You only want to convert the test set in the same manner as on the train set, so only call transform()