Search code examples
pythonscikit-learnnaivebayesoutliers

Return raw probabilities of sklearn's Gaussian Naive Bayes


I am using GaussianNB of Scikit-Learn to make supervised classification. When using the method "predict_proba", the sum of probalities is always equal to 1.

What I would like to return is the real value of the fitted gaussian distribution because my dataset contains many outliers. If I had 3 identified categories, I would like the model to tell me : "There is 10% of being category A, 0.5% of being category B and 4% of being category C". In other words, it is more likely to be an outlier.

Does sklearn return this result aswell ? Should I make the math based on mean and standard deviation ?


Solution

  • The solution I've finally used is the following :

    gaussian_model = naive_bayes.GaussianNB()
    jll = gaussian_model._joint_log_likelihood(X) 
    raw_proba = np.exp(jll)
    

    raw_proba is not between 0 and 1 but as I only want to rank results I don't really care about the figure itself.