Search code examples
pythontext-classificationnaivebayes

Difficulties to get the correct posterior value in a Naive Bayes Implementation


For studying purposes, I've tried to implement this "lesson" using python but "without" sckitlearn or something similar.

My attempt code is the follow:

import pandas, math

training_data = [
        ['A great game','Sports'],
        ['The election was over','Not sports'],
        ['Very clean match','Sports'],
        ['A clean but forgettable game','Sports'],
        ['It was a close election','Not sports']
]

text_to_predict = 'A very close game'
data_frame = pandas.DataFrame(training_data, columns=['data','label'])
data_frame = data_frame.applymap(lambda s:s.lower() if type(s) == str else s)
text_to_predict = text_to_predict.lower()
labels = data_frame.label.unique()
word_frequency = data_frame.data.str.split(expand=True).stack().value_counts()
unique_words_set = set()
unique_words = data_frame.data.str.split().apply(unique_words_set.update)
total_unique_words = len(unique_words_set)

word_frequency_per_labels = []
for l in labels:
    word_frequency_per_label = data_frame[data_frame.label == l].data.str.split(expand=True).stack().value_counts()
    for w, f in word_frequency_per_label.iteritems():
        word_frequency_per_labels.append([w,f,l])

word_frequency_per_labels_df = pandas.DataFrame(word_frequency_per_labels, columns=['word','frequency','label'])
laplace_smoothing = 1
results = []
for l in labels:
    p = []
    total_words_in_label = word_frequency_per_labels_df[word_frequency_per_labels_df.label == l].frequency.sum()
    for w in text_to_predict.split():
        x = (word_frequency_per_labels_df.query('word == @w and label == @l').frequency.to_list()[:1] or [0])[0]
        p.append((x + laplace_smoothing) / (total_words_in_label + total_unique_words))
    results.append([l,math.prod(p)])

print(results)
result = pandas.DataFrame(results, columns=['labels','posterior']).sort_values('posterior',ascending = False).labels.iloc[0]
print(result)

In the blog lesson their results are:

enter image description here

But my result were:

[['sports', 4.607999999999999e-05], ['not sports', 1.4293831139825827e-05]]

So, what did I do wrong in my python implementation? How can I get the same results?

Thanks in advance


Solution

  • You haven't multiplied by the priors p(Sport) = 3/5 and p(Not Sport) = 2/5. So just updating your answers by these ratios will get you to the correct result. Everything else looks good.

    So for example you implement p(a|Sports) x p(very|Sports) x p(close|Sports) x p(game|Sports) in your math.prod(p) calculation but this ignores the term p(Sport). So adding this in (and doing the same for the not sport condition) fixes things.

    In code this can be achieved by:

    prior = (data_frame.label == l).mean()
    results.append([l,prior*math.prod(p)])