Search code examples
pythonerror-handlingscikit-learnnaivebayes

sklearn's predict_proba returns infinite probabilties


I am using scikit-learn's CalibratedClassifierCV with GaussianNB() to run binary classification on some data. I have verified the inputs in .fit(X_train, y_train) and they have matching dimensions and both pass the np.isfinite test.

My problem is when I run .predict_proba(X_test). For some of the samples, the probabilities returned are array([-inf, inf]), and I can't really understand why.

This came to light when I tried running brier_score_loss on the resulting predictions, and it threw a ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I have added some data to this Google drive link. It's larger than what I wanted but I couldn't get consistent reproduction with smaller datasets. The code for reproduction lies below. There is some randomness to the code so if no infinites are found try running it again, but from my experiments it should find them on the first try.

from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

loaded = np.load('data.npz')
X = loaded['X']
y = loaded['y']

num = 2*10**4
sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2)
cal_classifier = CalibratedClassifierCV(GaussianNB(), method = 'isotonic', cv = sss)

classifier_fit = cal_classifier.fit(X[:num], y[:num])
predicted_probabilities = classifier_fit.predict_proba(X[num:num+num//4])[:,1]

predicted_probabilities[np.argwhere(~np.isfinite(predicted_probabilities))]

Solution

  • It seems that the Isotonic regression (used by CalibratedClassifierCV) is providing the inf values. More precisely it comes from a linear regression in Isotonic:

    The regression called on very small values (below a certain threshold but superior to 0) gives inf.

    In debug mode self.f_([0, 3.2392382784e-313]) returns [0.10430463576158941, inf] which is a strange behaviour. The implementation of interpolate.interp1d probably doesn't handle this kind of "super-small" values. Hope it helps.