Search code examples
pythonmachine-learningnltkbayesiansmoothing

How to change smoothing method of Naive Bayes classifier in NLTK?


I have trained a spam classifier using NLTK Naive Bayes method. Both the spam set and not spam set have 20,000 instances of words in training.

I have noticed that when encountering an unknown features, the classifier gives it 0.5 probability of spam:

>>> print classifier.prob_classify({'unkown_words':True}).prob('spam')
0.5

I know that this is called Laplace Smoothing in Bayes classification. However, I want to set the spam probability of unknown features to 0.4, because the unknown features are more probably is from the normal users. How can I implement it with NLTK?


Solution

  • I've found a really simple way to solve this problem.

    I selected 12,000 spam accounts and 18,000 normal accounts to re-train the Naive Bayes classifier. The proportion of spam account and normal accounts is 0.4 / 0.6.

    So when the classifier receives an unknown feature of the training set, it give 0.4 probability of spam:

    In [23]: classifier.prob_classify({'unknown_words': True}).prob('spam')
    Out[23]: 0.40000333322222587