Search code examples
matplotlibseabornstatsmodels

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved


Trying to plot a CDF with seaborns, then encountered this error:

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
  If increasing the limit yields no improvement it is advised to analyze 
  the integrand in order to determine the difficulties.  If the position of a 
  local difficulty can be determined (singularity, discontinuity) one will 
  probably gain from splitting up the interval and calling the integrator 
  on the subranges.  Perhaps a special-purpose integrator should be used.
  args=endog)[0] for i in range(1, gridsize)]

Some minutes after pressing the return key

../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
  args=endog)[0] for i in range(1, gridsize)]

Code:

plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()

If it could be of help:

print(len(data))

4360700

Sample data:

print(data[:10])

[ 0.00362846  0.00123409  0.00013711 -0.00029235  0.01515175  0.02780404
  0.03610236  0.03410224  0.03887933  0.0307084 ]

Have no idea what the subdivisions are, is there a way to increase it?


Solution

  • A kde plot is created by summing one gaussian bell shape for every data point. Summing 4 million curves will create memory and performance problems, which might cause come functions to fail. The exact error message can be very cryptic.

    The easiest way to work around the problem, is to subsample the data, as for a more or less smooth distribution the kde (and the cumultative kde or cdf) will look very similar whether the data is subsampled or not. Subsampling every 100th entry is easy using slicing data[::100].

    Alternatively, with that many data, the "real" cdf can be drawn by plotting the sorted data versus N evenly spaced numbers from 0 to 1. (Where N is the number of data points.)

    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    
    N = 1000000
    data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
    sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
    q = np.linspace(0, 1, data.size)
    data.sort()
    plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
    plt.legend()
    plt.show()
    

    example plot

    Note that in a similar, though slightly more involved, way you can draw a "more honest" kde given the differences of a large enough array of sorted data. np.interp interpolates the quantiles to a regularly spaced x-axis. As the raw differences are rather jaggy, some smoothing is needed.

    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    import statsmodels.api as sm
    
    N = 1000000
    data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
    sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
    p = np.linspace(0, 1, data.size)
    data.sort()
    
    x = np.linspace(data.min(), data.max(), 1000)
    y = np.interp(x, data, p)
    
    # use lowess filter to smoothen the curve
    lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
    plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')
    
    # plt.plot((x[:-1]+x[1:])/2,
    #         np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
    #         label='test np.diff')
    
    plt.legend()
    plt.show()
    

    example of kde