I am using GaussianNB of Scikit-Learn to make supervised classification. When using the method "predict_proba", the sum of probalities is always equal to 1.
What I would like to return is the real value of the fitted gaussian distribution because my dataset contains many outliers. If I had 3 identified categories, I would like the model to tell me : "There is 10% of being category A, 0.5% of being category B and 4% of being category C". In other words, it is more likely to be an outlier.
Does sklearn return this result aswell ? Should I make the math based on mean and standard deviation ?
The solution I've finally used is the following :
gaussian_model = naive_bayes.GaussianNB()
jll = gaussian_model._joint_log_likelihood(X)
raw_proba = np.exp(jll)
raw_proba
is not between 0 and 1 but as I only want to rank results I don't really care about the figure itself.