I have trained a spam classifier using NLTK Naive Bayes method. Both the spam set and not spam set have 20,000 instances of words in training.
I have noticed that when encountering an unknown features, the classifier
gives it 0.5
probability of spam:
>>> print classifier.prob_classify({'unkown_words':True}).prob('spam')
0.5
I know that this is called Laplace Smoothing
in Bayes classification. However, I want to set the spam probability of unknown features to 0.4
, because the unknown features are more probably is from the normal users. How can I implement it with NLTK?
I've found a really simple way to solve this problem.
I selected 12,000 spam accounts and 18,000 normal accounts to re-train the Naive Bayes classifier. The proportion of spam account and normal accounts is 0.4 / 0.6.
So when the classifier receives an unknown feature of the training set, it give 0.4 probability of spam:
In [23]: classifier.prob_classify({'unknown_words': True}).prob('spam')
Out[23]: 0.40000333322222587