Search code examples
apache-sparkmachine-learningclassificationnaivebayesapache-spark-ml

SPARK ML, Naive Bayes classifier: high probability prediction for one class


I am using Spark ML to optimise a Naive Bayes multi-class classifier.

I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category.

All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).

What are the possible reasons for this?

I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.


Solution

  • Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:

    enter image description here

    Since P(d) is the same for all classes we can simplify this to

    enter image description here

    where

    enter image description here

    Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:

    enter image description here

    Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:

    enter image description here

    and replace initial condition with:

    enter image description here

    These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values

    It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.