Search code examples
classificationprobabilitydata-miningnaivebayesprobability-distribution

How to prevent underflow when calculating probabilities with the Naïve Bayes Classifier algorithm?


I'm working on a Naïve Bayes Classifier algorithm for my data-mining course, however I'm having an underflow problem when calculating the probabilities. The particular data set has ~305 attributes, so as you can image, the final probability will be very low. How can I avoid this problem?


Solution

  • One way to go is to process the logarithms of the probabilities rather than the the probabilities themselves. The idea is you never calculate with probabilities, for fear you'll get 0.0, but instead calculate with log-probabilities.

    Most of the changes are easy: eg instead of multiplying the probabilities, add the logarithms and for many distributions (eg gaussians) its easy to compute the log-probability rather than the probability.

    The only slightly tricky bit is if you need to add up probabilities. But this is a well known problem, and searching for logsumexp gets plenty of hits, eg here. I believe there is a logsumexp function int scipy.