python matlab numpy fortran numerical-methods

Discontinuity in results when using scipy.integrate.quad

I've discovered a strange behavior when using scipy.integrate.quad. This behavior also shows up in Octave's quad function, which leads me to believe that it may have something to do with QUADPACK itself. Interestingly enough, using the exact same Octave code, this behavior does not show up in MATLAB.

On to the question. I'm numerically integrating a lognormal distribution over various bounds. For F is cdf of lognormal, a is lower bound and b is upper bound, I find that under some conditions,

integral(F, a, b) = 0 when b is a "very large number," while

integral(F, a, b) = the correct limit when b is np.inf. (or just Inf for Octave.)

Here's some example code to show it in action:

from __future__ import division
import numpy as np
import scipy.stats as stats
from scipy.integrate import quad

# Set up the probability space:
sigma = 0.1
mu = -0.5*(sigma**2)  # To get E[X] = 1
N = 7
z = stats.lognormal(sigma, 0, np.exp(mu))


# Set up F for integration:
F = lambda x: x*z.pdf(x)

# An example that appears to work correctly:
a, b = 1.0, 10
quad(F, a, b)
# (0.5199388..., 5.0097567e-11)

# But if we push it higher, we get a value which drops to 0:
quad(F, 1.0, 1000)
# (1.54400e-11, 3.0699e-11)

# HOWEVER, if we shove np.inf in there, we get correct answer again: 
quad(F, 1.0, np.inf)
# (0.5199388..., 3.00668e-09)


# If we play around we can see where it "breaks:"
quad(F, 1.0, 500)   # Ok
quad(F, 1.0, 831)   # Ok
quad(F, 1.0, 832)   # Here we suddenly hit close to zero. 
quad(F, 1.0, np.inf)   # Ok again

What is going on here? Why does quad(F, 1.0, 500) evaluate to approximately the correct thing, but quad(F, 1.0, b) goes to zero for all values 832 <= b < np.inf?

Solution

While I'm not exactly familiar with QUADPACK, adaptive integration generally works by increasing resolution until the answer no longer improves. Your function is so close to 0 for most of the interval (with F(10)==9.356e-116) that the improvement is negligible for the initial grid points that quad chooses, and it decides that the integral must be close to 0. Basically, if your data hides in a very narrow subinterval of the range of integration, quad eventually won't be able to find it.

For integration from 0 to inf, the interval obviously cannot be subdivided into a finite number of intervals, so quad will need some preprocessing before computing the integral. For example, a change of variables like y=1/(1+x) would map the interval 0..inf to 0..1. Subdividing that interval will sample more points near zero from the original function, enabling quad to find your data.