Search code examples
pythonscikit-learntext-classificationnaivebayes

Is there anyway to extract Maximum A Posteriori in scikit-learn Multinomial Naive Bayes based on the Stanford NLP research paper?


I'm trying to replicate the results of the paper in the link

https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

This link explains how Multinomial Naive Bayes works for text classification.

I've tried to reproduce the example using scikit learn.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.naive_bayes import MultinomialNB

#TRAINING SET
dftrain = pd.DataFrame(data=np.array([["Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"], 
["yes", "yes", "yes", "no"]]))

dftrain = dftrain.T
dftrain.columns = ['text', 'label']

#TEST SET
dftest = pd.DataFrame(data=np.array([["Chinese Chinese Chinese Tokyo Japan"]]))
dftest.columns = ['text']

count_vectorizer = CountVectorizer(min_df=0, token_pattern=r"\b\w+\b", stop_words = None)
count_train = count_vectorizer.fit_transform(dftrain['text'])
count_test = count_vectorizer.transform(dftest['text'])

clf = MultinomialNB()
clf.fit(count_train, df['label'])
clf.predict(count_test)

The output is correctly printed as:

array(['yes'],
  dtype='<U3')

Just like how its mentioned in the paper! The paper predicts it as YES because

P(yes | test set) = 0.0003 > P(no | test set) = 0.0001

I want to be able to see those two probabilities!

When I type:

clf.predict_proba(count_test)

I get

array([[ 0.31024139,  0.68975861]])

I think what this means is:

P(test belongs to label 'no') = 0.31024139 and P(test belongs to label 'yes') = 0.68975861

Therefore, scikit-learn predicts the text as belonging to the label yes, but

My question is: Why are the probabilities different? P(yes | test set) = 0.0003 > P(no | test set) = 0.0001, I don't see the numbers 0.0003 and 0.0001 but instead see 0.31024139 and 0.68975861

Am I missing something here? Does this have something to do with class_prior parameter?

I did read the documentation!

http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes

Apparently, the parameter is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting.

What I'm wondering is, is there anyway, I can replicated and see the results as the one in the research paper?


Solution

  • This is more to do with the meaning of the probability predict_proba produces. the number .0003 and .0001 are not normalised i.e. they don't sum to one. if you normalise these values you'll get the same result

    see the snippet below:

    clf.predict_proba(count_test)
    Out[63]: array([[ 0.31024139,  0.68975861]])
    
    In [64]: p = (3/4)*((3/7)**3)*(1/14)*(1/14)
    
    In [65]: p
    Out[65]: 0.00030121377997263036
    
    In [66]: p0 = (1/4)*((2/9)**3)*(2/9)*(2/9)
    
    In [67]: p0
    Out[67]: 0.00013548070246744223
    
    #normalised values
    In [68]: p/(p0+p)
    Out[68]: 0.6897586117634674
    
    In [69]: p0/(p0+p)
    Out[69]: 0.3102413882365326