Search code examples

Naive bayes classifer from scratch in python?

I wrote a simple naive bayes classifer for my toy dataset

                 msg  spam
0  free home service     1
1      get free data     1
2  we live in a home     0
3    i drive the car     0

Full code

import pandas as pd
from collections import Counter

data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)

def word_counter(word_list):
    words = []
    for x in word_list:
        for i in x:
    word_count = Counter(words)
    return word_count

spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)

ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)

total = len(spam.keys())+len(ham.keys())

# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)

new_data = ["get free home service","i live in car"]
for msg in new_data:
    data = msg.split()
    # Likelihood
    spam_likelihood = 0.001 # low value to prevent divisional error
    ham_likelihood = 0.001
    for i in data:
        if i in spam:
            if spam_likelihood==0.001:
                spam_likelihood = spam[i]/total
            spam_likelihood = spam[i]/total * spam_likelihood
        if i in ham:
            if ham_likelihood==0.001:
                ham_likelihood = ham[i]/total
            ham_likelihood = ham[i]/total * ham_likelihood
    # marginal likelihood
    marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
    spam_posterior = (spam_likelihood*spam_prior)/marginal

The problem is it failed completely in my Spamminess classification for unseen data.

get free home service 0.07
i live in car 97.46

I expected high value for get free home service and low value for i live in car.

My question is if this error is due to lack of additional data or its because of my coding error?


  • The problem is with the code. The likelihood is computed incorrectly. See Wikipedia:Naive_Bayes_classifier for the right formula for the likelihood under the bag-of-words model.

    Your code works as if the likelihood p(word | spam) is 1 when the word wasn't previously encountered in spam. With Laplace smoothing, it should be 1 / (spam_total + 1), where spam_total in the total number of words in spam (with repetition).

    When the word was previously encountered in spam x times, it should be (x + 1) / (spam_total + 1).

    I've changed the Counter to defaultdict to conveniently deal with words that weren't encountered before, fixed the likelihood calculation and added Laplace smoothing:

    import pandas as pd
    from collections import defaultdict
    data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
    df = pd.DataFrame(data=data)
    def word_counter(sentence_list):
        word_count = defaultdict(lambda:0)
        for sentence in sentence_list:
            for word in sentence:
                word_count[word] += 1
        return word_count
    spam = [x.split() for x in set(df['msg'][df['spam']==1])]
    spam_total = sum([len(sentence) for sentence in spam])
    spam = word_counter(spam)
    ham = [x.split() for x in set(df['msg'][df['spam']==0])]
    ham_total = sum([len(sentence) for sentence in ham])
    ham = word_counter(ham)
    # Prior
    spam_prior = len(df['spam'][df['spam']==1])/len(df)
    ham_prior = len(df['spam'][df['spam']==0])/len(df)
    new_data = ["get free home service","i live in car"]
    for msg in new_data:
        data = msg.split()
        # Likelihood
        spam_likelihood = 1
        ham_likelihood = 1
        for word in data:
            spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
            ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
        # marginal likelihood
        marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
        spam_posterior = (spam_likelihood * spam_prior) / marginal

    Now the results are like expected:

    get free home service 98.04
    i live in car 20.65

    This can be further improved, e.g. for numerical stability the multiplication of all these probabilities should be replaced by adding logarithms.