Search code examples
scikit-learnnlpclassificationpredictionnaivebayes

How to get predictions for new data from MultinomialNB?


I'm venturing into a new topic and experimenting with categorising product names. Without deeper knowledge, the use of MultinomialNB (superficially) already yielded quite good results for my use case.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


df = pd.DataFrame({
    'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
    'label':['shirt','shirt','shoe','shoe','shoe']
})

count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['title'])
bow = np.array(bow.todense())

X = bow
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = MultinomialNB().fit(X_train, y_train)

model.predict(X_test)

Based on the trainigs of the above simplified example, I would like to categorise completely new titles and output them with the predicted labels:

new = pd.DataFrame({
    'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
    'label': np.nan
})

Unfortunately, I am not sure of the next steps and would hope for some support.

...
count_vec = CountVectorizer()
bow = count_vec.fit_transform(new['title'])
bow = np.array(bow.todense())
model.predict(bow)

Solution

  • It's a mistake to fit CountVectorizer on the whole dataset, because the test set should not be used at all during training. This discipline not only follows proper ML principles (to prevent data leakage), it also avoids this practical problem: when the test set is prepared together with the training set, it gets confusing to apply the model to another test set.

    The clean way to proceed is to always split the data first between training and test set, this way one is forced to correctly transform the test set independently from the training set. Then it's easy to apply the model on another test set.

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    from sklearn import preprocessing
    
    df = pd.DataFrame({
        'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
        'label':['shirt','shirt','shoe','shoe','shoe']
    })
    
    X = df['title']
    y = df['label']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
    
    # 1) Training: use only training set!
    
    # labels should be encoded
    le = preprocessing.LabelEncoder()
    y_train_enc = le.fit_transform(y_train)
    
    count_vec = CountVectorizer()
    X_train_bow = count_vec.fit_transform(X_train)
    X_train_bow = np.array(X_train_bow.todense())
    
    model = MultinomialNB().fit(X_train_bow, y_train_enc)
    
    
    
    # 2) Testing: apply previous transformation to test set before applying model
    X_test_bow = count_vec.transform(X_test)
    X_test_bow = np.array(X_test_bow.todense())
    y_test_enc = model.predict(X_test_bow)
    
    print("Predicted labels test set 1:",le.inverse_transform(y_test_enc))
    
    
    
    # 3) apply to another dataset = just another step of testing, same as above
    new = pd.DataFrame({
        'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
        'label': np.nan
    })
    X_new = new['title']
    X_new_bow = count_vec.transform(X_new)
    X_new_bow = np.array(X_new_bow.todense())
    y_new_enc = model.predict(X_new_bow)
    
    print("Predicted labels test set 2:", le.inverse_transform(y_new_enc))
    

    Notes:

    • This point is not specific to MultinomialNB, this is the correct method for any classifier.
    • With real data it's often a good idea to use the min_df argument with CountVectorizer, because rare words increase the number of features, don't help predicting the label and cause overfitting.