Search code examples
pythonmachine-learningnaivebayes

Avoid getting nonsensical values (eg. -0.0) when classifying very long texts with Naive Bayes


I've implemented a fairly efficient implementation of a multinomial Naive Bayes classifier and it works like a charm. That is until the classifier encounters very long messages (on the order of 10k words) where the predictions results are nonsensical (e.g. -0.0) and I get a math domain error when using Python's math.log function. The reason why I'm using log is that when working with very small floats, what you get if you multiply them is very small floats and the log helps avoiding infinitely small numbers that would fail the predictions.

Some context

I'm using the Bag of words approach without any sort of vectorization (like TF-IDF, because I can't figure out how to implement it properly and balance 0-occurring words. A snippet for that would be appreciated too ;) ) and I'm using frequency count and Laplace additive smoothing (basically adding 1 to each frequency count so it's never 0).

I could just take the log out, but that would mean that in the case of such long messages the engine would fail to detect them properly anyway, so it's not the point.


Solution

  • There is no multiplication in Naive Bayes if you apply log-sum-exp, only additions, so the underflow is unlikely. And if you use smoothing (as you say yo do), you would never get undefined behavior for log.

    This stats stackexchange answer describes the underlying math. For reference implementation I have a snippet of mine lying around for MultinomialNaiveBayes (analogous to sklearn's sklearn.naive_bayes.MultinomialNB and with similar API):

    import numpy as np
    import scipy
    
    
    class MultinomialNaiveBayes:
        def __init__(self, alpha: float = 1.0):
            self.alpha = alpha
    
        def fit(self, X, y):
            # Calculate priors from data
            self.log_priors_ = np.log(np.bincount(y) / y.shape[0])
    
            # Get indices where data belongs to separate class, creating a slicing mask.
            class_indices = np.array(
                np.ma.make_mask([y == current for current in range(len(self.log_priors_))])
            )
            # Divide dataset based on class indices
            class_datasets = np.array([X[indices] for indices in class_indices])
    
            # Calculate how often each class occurs and add alpha smoothing.
            # Reshape into [n_classes, features]
            classes_metrics = (
                np.array([dataset.sum(axis=0) for dataset in class_datasets]).reshape(
                    len(self.log_priors_), -1
                )
                + self.alpha
            )
    
            # Calculate log likelihoods
            self.log_likelihoods_ = np.log(
                classes_metrics / classes_metrics.sum(axis=1)[:, np.newaxis]
            )
    
            return self
    
        def predict(self, X):
            # Return most likely class
            return np.argmax(
                scipy.sparse.csr_matrix.dot(X, self.log_likelihoods_.T) + self.log_priors_,
                axis=1,
            )
    

    BTW. -0.0 is exactly the same as 0.0 and is sensical value.