statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved

Trying to plot a CDF with seaborns, then encountered this error:

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
  If increasing the limit yields no improvement it is advised to analyze 
  the integrand in order to determine the difficulties.  If the position of a 
  local difficulty can be determined (singularity, discontinuity) one will 
  probably gain from splitting up the interval and calling the integrator 
  on the subranges.  Perhaps a special-purpose integrator should be used.
  args=endog)[0] for i in range(1, gridsize)]

Some minutes after pressing the return key

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
  args=endog)[0] for i in range(1, gridsize)]

Code:

plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()

If it could be of help:

print(len(data))

4360700

Sample data:

print(data[:10])

[ 0.00362846  0.00123409  0.00013711 -0.00029235  0.01515175  0.02780404
  0.03610236  0.03410224  0.03887933  0.0307084 ]

Have no idea what the subdivisions are, is there a way to increase it?

Solution

A kde plot is created by summing one gaussian bell shape for every data point. Summing 4 million curves will create memory and performance problems, which might cause come functions to fail. The exact error message can be very cryptic.

The easiest way to work around the problem, is to subsample the data, as for a more or less smooth distribution the kde (and the cumultative kde or cdf) will look very similar whether the data is subsampled or not. Subsampling every 100^th entry is easy using slicing data[::100].

Alternatively, with that many data, the "real" cdf can be drawn by plotting the sorted data versus N evenly spaced numbers from 0 to 1. (Where N is the number of data points.)

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
q = np.linspace(0, 1, data.size)
data.sort()
plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
plt.legend()
plt.show()

Note that in a similar, though slightly more involved, way you can draw a "more honest" kde given the differences of a large enough array of sorted data. np.interp interpolates the quantiles to a regularly spaced x-axis. As the raw differences are rather jaggy, some smoothing is needed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
p = np.linspace(0, 1, data.size)
data.sort()

x = np.linspace(data.min(), data.max(), 1000)
y = np.interp(x, data, p)

# use lowess filter to smoothen the curve
lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')

# plt.plot((x[:-1]+x[1:])/2,
#         np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
#         label='test np.diff')

plt.legend()
plt.show()