python machine-learning classification naivebayes

Naive bayes classifer from scratch in python?

I wrote a simple naive bayes classifer for my toy dataset

                 msg  spam
0  free home service     1
1      get free data     1
2  we live in a home     0
3    i drive the car     0

Full code

import pandas as pd
from collections import Counter

data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)

def word_counter(word_list):
    words = []
    for x in word_list:
        for i in x:
            words.append(i)
    
    word_count = Counter(words)
    return word_count

spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)

ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)

total = len(spam.keys())+len(ham.keys())

# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)

new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
    data = msg.split()
    
    # Likelihood
    spam_likelihood = 0.001 # low value to prevent divisional error
    ham_likelihood = 0.001
    for i in data:
        if i in spam:
            if spam_likelihood==0.001:
                spam_likelihood = spam[i]/total
                continue
            spam_likelihood = spam[i]/total * spam_likelihood
        if i in ham:
            if ham_likelihood==0.001:
                ham_likelihood = ham[i]/total
                continue
            ham_likelihood = ham[i]/total * ham_likelihood
    
    # marginal likelihood
    marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
    
    spam_posterior = (spam_likelihood*spam_prior)/marginal
    print(msg,round(spam_posterior*100,2))

The problem is it failed completely in my Spamminess classification for unseen data.

get free home service 0.07
i live in car 97.46

I expected high value for get free home service and low value for i live in car.

My question is if this error is due to lack of additional data or its because of my coding error?

Solution

The problem is with the code. The likelihood is computed incorrectly. See Wikipedia:Naive_Bayes_classifier for the right formula for the likelihood under the bag-of-words model.

Your code works as if the likelihood p(word | spam) is 1 when the word wasn't previously encountered in spam. With Laplace smoothing, it should be 1 / (spam_total + 1), where spam_total in the total number of words in spam (with repetition).

When the word was previously encountered in spam x times, it should be (x + 1) / (spam_total + 1).

I've changed the Counter to defaultdict to conveniently deal with words that weren't encountered before, fixed the likelihood calculation and added Laplace smoothing:

import pandas as pd
from collections import defaultdict

data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)

def word_counter(sentence_list):
    word_count = defaultdict(lambda:0)
    for sentence in sentence_list:
        for word in sentence:
            word_count[word] += 1
    return word_count

spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam_total = sum([len(sentence) for sentence in spam])
spam = word_counter(spam)

ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham_total = sum([len(sentence) for sentence in ham])
ham = word_counter(ham)

# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)

new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
    data = msg.split()
    
    # Likelihood
    spam_likelihood = 1
    ham_likelihood = 1
    for word in data:
        spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
        ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
    
    # marginal likelihood
    marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
    
    spam_posterior = (spam_likelihood * spam_prior) / marginal
    print(msg,round(spam_posterior*100,2))

Now the results are like expected:

    Spamminess
get free home service 98.04
i live in car 20.65

This can be further improved, e.g. for numerical stability the multiplication of all these probabilities should be replaced by adding logarithms.