Search code examples
pythonprecisionprobability

calculating probabilities using cdf without numerical underflow/overflow (in Python)


Consider the following task: for an arbitrary value x and a positive number s, compute the probability that a normally distributed random variable falls in the interval of length s centered at x.

In principle this is easily done:

def normal_inverval_prob(y, s, mean, sd):
    return norm.cdf(x=y+s/2.0, loc=mean, scale=sd) - norm.cdf(x=y-s/2.0, loc=mean, scale=sd)

normal_inverval_prob(-3, .2, 1, 1)#2.7438837105055897e-05
normal_inverval_prob(-3, .2, 1, .1)# 0.0

My problem is that last line: for some values I get a probability of zero though the actual probability is some small number greater than zero. This causes divide-by-zero issues for me later in my code.

Turns out I can work with log probabilities, so I rework the function to give me log probabilities using only the log cdf:

def normal_inverval_logprob(y, s, mean, sd):
    p1 = norm.logcdf(x=y+s/2.0, loc=mean, scale=sd)
    p0 = norm.logcdf(x=y-s/2.0, loc=mean, scale=sd)
    return p1 + np.log1p(-np.exp(p0 - p1))

np.exp(normal_inverval_logprob(-3, .2, 1, 1))#2.7438837105055897e-05
normal_inverval_logprob(-3, .2, 1, .1)#-765.0831565643776

For other values, this log probability function runs into problems:

normal_inverval_logprob(3, .2, 1, .1)
/home/keith/.local/lib/python3.6/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in log1p
  after removing the cwd from sys.path.
-inf

The problem, as you might expect is that the exp of the difference in log cdfs at this point evaluates to 1 (a different kind of numerical underflow problem) despite the fact that the log cdfs are not equal:

np.exp(norm.logcdf(2.9, 1, .1) - norm.logcdf(3.1, 1, .1))#1.0
norm.logcdf(3.1, 1, .1) > norm.logcdf(2.9, 1, .1)#True
np.allclose(norm.logcdf(3.1, 1, .1), norm.logcdf(2.9, 1, .1))#True

I'm not sure how to work around this (or if there is some totally different way to accomplish my goal).


Solution

  • One trivial approach is to use expm1 instead of log1p:

    return p1 + np.log(-np.expm1(p0 - p1))
    

    If even that fails, you could approximate with a Riemann sum (here, of just one term):

    def normal_inverval_prob(y, s, mean, sd):
      return norm.pdf(x=y, loc=mean, scale=sd) * s
    

    This will underestimate the tails; you could average the values at the endpoints of your interval to get an upper bound there. Of course, with the exp(-x2) eventually even that will underflow: the PDF is already too small for float64 by z=±39.