Search code examples

Is there anyway to extract Maximum A Posteriori in scikit-learn Multinomial Naive Bayes based on the Stanford NLP research paper?

I'm trying to replicate the results of the paper in the link

This link explains how Multinomial Naive Bayes works for text classification.

I've tried to reproduce the example using scikit learn.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.naive_bayes import MultinomialNB

dftrain = pd.DataFrame(data=np.array([["Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"], 
["yes", "yes", "yes", "no"]]))

dftrain = dftrain.T
dftrain.columns = ['text', 'label']

dftest = pd.DataFrame(data=np.array([["Chinese Chinese Chinese Tokyo Japan"]]))
dftest.columns = ['text']

count_vectorizer = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b", stop_words = None)
count_train = count_vectorizer.fit_transform(dftrain['text'])
count_test = count_vectorizer.transform(dftest['text'])

clf = MultinomialNB(), df['label'])

The output is correctly printed as:


Just like how its mentioned in the paper! The paper predicts it as YES because

P(yes | test set) = 0.0003 > P(no | test set) = 0.0001

I want to be able to see those two probabilities!

When I type:


I get

array([[ 0.31024139,  0.68975861]])

I think what this means is:

P(test belongs to label 'no') = 0.31024139 and P(test belongs to label 'yes') = 0.68975861

Therefore, scikit-learn predicts the text as belonging to the label yes, but

My question is: Why are the probabilities different? P(yes | test set) = 0.0003 > P(no | test set) = 0.0001, I don't see the numbers 0.0003 and 0.0001 but instead see 0.31024139 and 0.68975861

Am I missing something here? Does this have something to do with class_prior parameter?

I did read the documentation!

Apparently, the parameter is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting.

What I'm wondering is, is there anyway, I can replicated and see the results as the one in the research paper?


  • This is more to do with the meaning of the probability predict_proba produces. the number .0003 and .0001 are not normalised i.e. they don't sum to one. if you normalise these values you'll get the same result

    see the snippet below:

    Out[63]: array([[ 0.31024139,  0.68975861]])
    In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)
    In [65]: p
    Out[65]: 0.00030121377997263036
    In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)
    In [67]: p0
    Out[67]: 0.00013548070246744223
    #normalised values
    In [68]: p/(p0+p)
    Out[68]: 0.6897586117634674
    In [69]: p0/(p0+p)
    Out[69]: 0.3102413882365326