I wrote a simple naive bayes classifer for my toy dataset
msg spam
0 free home service 1
1 get free data 1
2 we live in a home 0
3 i drive the car 0
Full code
import pandas as pd
from collections import Counter
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(word_list):
words = []
for x in word_list:
for i in x:
words.append(i)
word_count = Counter(words)
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham = word_counter(ham)
total = len(spam.keys())+len(ham.keys())
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 0.001 # low value to prevent divisional error
ham_likelihood = 0.001
for i in data:
if i in spam:
if spam_likelihood==0.001:
spam_likelihood = spam[i]/total
continue
spam_likelihood = spam[i]/total * spam_likelihood
if i in ham:
if ham_likelihood==0.001:
ham_likelihood = ham[i]/total
continue
ham_likelihood = ham[i]/total * ham_likelihood
# marginal likelihood
marginal = (spam_likelihood*spam_prior) + (ham_likelihood*ham_prior)
spam_posterior = (spam_likelihood*spam_prior)/marginal
print(msg,round(spam_posterior*100,2))
The problem is it failed completely in my Spamminess
classification for unseen data.
get free home service 0.07
i live in car 97.46
I expected high value for get free home service
and low value for i live in car
.
My question is if this error is due to lack of additional data or its because of my coding error?
The problem is with the code. The likelihood is computed incorrectly. See Wikipedia:Naive_Bayes_classifier for the right formula for the likelihood under the bag-of-words model.
Your code works as if the likelihood p(word | spam) is 1 when the word wasn't previously encountered in spam. With Laplace smoothing, it should be 1 / (spam_total + 1), where spam_total in the total number of words in spam (with repetition).
When the word was previously encountered in spam x times, it should be (x + 1) / (spam_total + 1).
I've changed the Counter to defaultdict to conveniently deal with words that weren't encountered before, fixed the likelihood calculation and added Laplace smoothing:
import pandas as pd
from collections import defaultdict
data = {'msg':['free home service','get free data','we live in a home','i drive the car'],'spam':[1,1,0,0]}
df = pd.DataFrame(data=data)
print(df)
def word_counter(sentence_list):
word_count = defaultdict(lambda:0)
for sentence in sentence_list:
for word in sentence:
word_count[word] += 1
return word_count
spam = [x.split() for x in set(df['msg'][df['spam']==1])]
spam_total = sum([len(sentence) for sentence in spam])
spam = word_counter(spam)
ham = [x.split() for x in set(df['msg'][df['spam']==0])]
ham_total = sum([len(sentence) for sentence in ham])
ham = word_counter(ham)
# Prior
spam_prior = len(df['spam'][df['spam']==1])/len(df)
ham_prior = len(df['spam'][df['spam']==0])/len(df)
new_data = ["get free home service","i live in car"]
print("\n\tSpamminess")
for msg in new_data:
data = msg.split()
# Likelihood
spam_likelihood = 1
ham_likelihood = 1
for word in data:
spam_likelihood *= (spam[word] + 1) / (spam_total + 1)
ham_likelihood *= (ham[word] + 1) / (ham_total + 1)
# marginal likelihood
marginal = (spam_likelihood * spam_prior) + (ham_likelihood * ham_prior)
spam_posterior = (spam_likelihood * spam_prior) / marginal
print(msg,round(spam_posterior*100,2))
Now the results are like expected:
Spamminess
get free home service 98.04
i live in car 20.65
This can be further improved, e.g. for numerical stability the multiplication of all these probabilities should be replaced by adding logarithms.