python algorithm nlp text-classification naivebayes

Implementing Naive Bayes text categorization but I keep getting zeros

I am using Naive Bayes for text categorization this is how I created the initial weights for each term in the specified category:

term1:number of times term 1 exists/number of documents in categoryA
term2:number of times term 2 exists/number of documents in categoryA
term3:number of times term 3 exists/number of documents in categoryA
term1:number of times term 1 exists/number of documents in categoryB
term2:number of times term 2 exists/number of documents in categoryB
term3:number of times term 3 exists/number of documents in categoryB

with the new test document I adjust the weights based on whether the term exists in the test document or not:

term1: exists in the test document so I use the same weight for categoryA_term1 as above
term2: does NOT exist in the test document so I use the 1-weight for categoryA_term2
term3: does NOT exist in the test document so I use the 1-weight for categoryA_term3
term1: exists in the test document so I use the same weight for categoryB_term1 as above
term2: does NOT exist in the test document so I use the 1-weight for categoryB_term2
term3: exists in the test document so I use the same weight for categoryB_term2 as above

Then I multiply the weights for each category. This works when I create dummy train/test documents of one sentence each but when I implement real documents for train/test documents I keep getting zero when I multiple it all together. Is this because the probabilities are so small that after multiplying so many small numbers, python just converges to zero?? I am so stuck and I just keep running into the same zero issue :( I would really appreciate your help!

Solution

As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign the conditional probability (k+1)/(n+2) or (k+a)/(n+2a) to that word given the category.

Instead of taking a product of many small numbers, it is standard to compute the logarithm of the product.

log x*y = log x + log y
log(P(a0|c) * P(a1|c) * ... * P(ak|c))
    = log P(a0|c) + log P(a1|c) + ... + log P(ak|c)

Then you have a sum of numbers that are not so small. Avoid using log 0. You can exponentiate afterwards if necessary, but usually you just translate your decision threshold into a condition on the logarithm.