I've implemented a fairly efficient implementation of a multinomial Naive Bayes classifier and it works like a charm. That is until the classifier encounters very long messages (on the order of 10k words) where the predictions results are nonsensical (e.g. -0.0) and I get a math domain error
when using Python's math.log
function. The reason why I'm using log is that when working with very small floats, what you get if you multiply them is very small floats and the log helps avoiding infinitely small numbers that would fail the predictions.
Some context
I'm using the Bag of words approach without any sort of vectorization (like TF-IDF, because I can't figure out how to implement it properly and balance 0-occurring words. A snippet for that would be appreciated too ;) ) and I'm using frequency count and Laplace additive smoothing (basically adding 1 to each frequency count so it's never 0).
I could just take the log out, but that would mean that in the case of such long messages the engine would fail to detect them properly anyway, so it's not the point.
There is no multiplication in Naive Bayes if you apply log-sum-exp
, only additions, so the underflow is unlikely. And if you use smoothing (as you say yo do), you would never get undefined behavior for log
.
This stats stackexchange answer describes the underlying math. For reference implementation I have a snippet of mine lying around for MultinomialNaiveBayes
(analogous to sklearn's sklearn.naive_bayes.MultinomialNB
and with similar API):
import numpy as np
import scipy
class MultinomialNaiveBayes:
def __init__(self, alpha: float = 1.0):
self.alpha = alpha
def fit(self, X, y):
# Calculate priors from data
self.log_priors_ = np.log(np.bincount(y) / y.shape[0])
# Get indices where data belongs to separate class, creating a slicing mask.
class_indices = np.array(
np.ma.make_mask([y == current for current in range(len(self.log_priors_))])
)
# Divide dataset based on class indices
class_datasets = np.array([X[indices] for indices in class_indices])
# Calculate how often each class occurs and add alpha smoothing.
# Reshape into [n_classes, features]
classes_metrics = (
np.array([dataset.sum(axis=0) for dataset in class_datasets]).reshape(
len(self.log_priors_), -1
)
+ self.alpha
)
# Calculate log likelihoods
self.log_likelihoods_ = np.log(
classes_metrics / classes_metrics.sum(axis=1)[:, np.newaxis]
)
return self
def predict(self, X):
# Return most likely class
return np.argmax(
scipy.sparse.csr_matrix.dot(X, self.log_likelihoods_.T) + self.log_priors_,
axis=1,
)
BTW. -0.0
is exactly the same as 0.0
and is sensical value.