Search code examples
python-3.xscikit-learnvalueerrordictvectorizer

Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer


I'm trying to do MultinomialNB. I got Value Error: dimension mismatch.

I'm using DictVectorizer for the training data and LabelEncoder for the class.

This is my code:

def create_token(inpt):
    return inpt.split(' ')

def tok_freq(inpt):
    tok = {}
    for i in create_token(inpt):
        if i not in tok:
            tok[i] = 1
        else:
            tok[i] += 1
    return tok

training_data = []
for i in range(len(raw_data)):
    training_data.append((get_freq_of_tokens(raw_data.iloc[i].text), raw_data.iloc[i].category))

#vectorization
X, y = list(zip(*training_data))
label = LabelEncoder()
vector = DictVectorizer(dtype=float, sparse=True)
X = vector.fit_transform(X)
y = label.fit_transform(y)
multinb = mnb()
multinb.fit(X,y)

#vectorization for testing set
Xz = tok_freq(sms)
testX = vector.fit_transform(Xz)

multinb.predict(testX)

Which part of my code is wrong? Thanks.


Solution

  • Change

    testX = vector.fit_transform(Xz)
    

    to:

    testX = vector.transform(Xz)
    

    When you do fit() or fit_transform(), you are essentially training the vectorizer on the new data, which is not what you want. You only want to convert the test set in the same manner as on the train set, so only call transform()