Search code examples
pythonalgorithmnlptext-classificationnaivebayes

Implementing Naive Bayes text categorization but I keep getting zeros


I am using Naive Bayes for text categorization this is how I created the initial weights for each term in the specified category:

  • term1:number of times term 1 exists/number of documents in categoryA
  • term2:number of times term 2 exists/number of documents in categoryA
  • term3:number of times term 3 exists/number of documents in categoryA

  • term1:number of times term 1 exists/number of documents in categoryB

  • term2:number of times term 2 exists/number of documents in categoryB
  • term3:number of times term 3 exists/number of documents in categoryB

with the new test document I adjust the weights based on whether the term exists in the test document or not:

  • term1: exists in the test document so I use the same weight for categoryA_term1 as above
  • term2: does NOT exist in the test document so I use the 1-weight for categoryA_term2
  • term3: does NOT exist in the test document so I use the 1-weight for categoryA_term3

  • term1: exists in the test document so I use the same weight for categoryB_term1 as above

  • term2: does NOT exist in the test document so I use the 1-weight for categoryB_term2
  • term3: exists in the test document so I use the same weight for categoryB_term2 as above

Then I multiply the weights for each category. This works when I create dummy train/test documents of one sentence each but when I implement real documents for train/test documents I keep getting zero when I multiple it all together. Is this because the probabilities are so small that after multiplying so many small numbers, python just converges to zero?? I am so stuck and I just keep running into the same zero issue :( I would really appreciate your help!


Solution

  • As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign the conditional probability (k+1)/(n+2) or (k+a)/(n+2a) to that word given the category.

    Instead of taking a product of many small numbers, it is standard to compute the logarithm of the product.

    log x*y = log x + log y
    log(P(a0|c) * P(a1|c) * ... * P(ak|c))
        = log P(a0|c) + log P(a1|c) + ... + log P(ak|c)
    

    Then you have a sum of numbers that are not so small. Avoid using log 0. You can exponentiate afterwards if necessary, but usually you just translate your decision threshold into a condition on the logarithm.