Search code examples
pythonnormal-distributionastropycdf

How to use Astropy Kuiper test


I want to test how close the distribution of a data set is from a Gaussian with mean=0 and variance=1.

The kuiper test from astropy.stats has a cdf parameter, from the documentation: "A callable to evaluate the CDF of the distribution being tested against. Will be called with a vector of all values at once. The default is a uniform distribution", but I don't know how to use this to test against a normal distribution. What if I want e.g. a normal distribution with mean 0.2 and variance 2?

So I used kuiper_two, also from astropy, and created a random normal distribution. See example below.

The problem I see with this is that it depends on the number of data points I generate to compare against. If I'd have used 100 instead of 10000 data points the probability (fpp) would have raised to 43%.

I guess the question is, how do I do this properly? Also, how do I interpret the D number?

# create data and its cdf
np.random.seed(0)
data = np.random.normal(loc=0.02, scale=1.15, size=50)
data_sort = np.sort(data)
data_cdf = [x/len(data) for x in range(0, len(data))]

# create the normal data with mean 0 and variance 1
xx = np.random.normal(loc=0, scale=1, size=10000)
xx_sort = np.sort(xx)
xx_cdf = [x/len(xx) for x in range(0, len(xx))]
# compute the pdf for a plot
x = np.linspace(-4, 4, 50)
x_pdf = stats.norm.pdf(x, 0, 1)

# we can see it all in a plot
fig, ax = plt.subplots(figsize=(8, 6))
plt.hist(xx, bins=20, density=True, stacked=True, histtype='stepfilled', alpha=0.6)
plt.hist(data, density=True, stacked=True, histtype='step', lw=3)
plt.plot(x, x_pdf, lw=3, label='G($\mu=0$, $\sigma^2=1$)')
ax2 = ax.twinx()
ax2.plot(xx_sort, xx_cdf, marker='o', ms=8, mec='green', mfc='green', ls='None')
ax2.plot(data_sort, data_cdf, marker='^', ms=8, mec='orange', mfc='orange', ls='None')

# Kuiper test
D, fpp = kuiper_two(data_sort, xx_sort)
print('#   D number =', round(D, 5))
print('#   fpp =', round(fpp, 5))
# Which resulted in:
#   D number = 0.211
#   fpp = 0.14802

enter image description here


Solution

  • astropy.stats.kuiper expects as first argument a sample from the distribution you want to test, and as second argument the CDF of the distribution you want to test against.

    This variable is a Callable and expects itself (a) sample(s) for which to return the value(s) under the cumulative distribution function. You can use scipy.stat's CDFs for that. By using functools.partial we can set any parameters.

    from scipy import stats
    from scipy.stats import norm
    from astropy.stats import kuiper
    
    from functools import partial
    from random import shuffle
    
    np.random.seed(0)
    data = np.random.normal(loc=0.02, scale=1.15, size=50)
    
    print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
    
    # Output: (0.2252118027033838, 0.08776036566607946)
    
    # The data does not have to be sorted, in case you wondered:
    shuffle(data)
    
    print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
    
    # Output: (0.2252118027033838, 0.08776036566607946)
    

    In this diagram from the Wikipedia article about this test, you get an idea what the Kuiper statistic V measures:

    [Lucas(CA), CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons]3 The

    If you share the parameters of the comparison distribution, the distance is lower and the estimated probability that the respective underlying CDFs are identical rises:

    print(kuiper(data, partial(norm.cdf, loc=0.02, scale=1.15)))
    
    # Output: (0.14926352419821276, 0.68365004302431)
    

    The function astropy.stats.kuiper_two in contrast, expects two samples of empirical data to compare with one another. So if you want to compare to a distribution with a tractable CDF, it is preferable to use the CDF directly (with kuiper) instead of sampling from the comparison distribution (and using kuiper_two).

    And one nit-pick. Apart from using CDF in variable names that are not a CDF, this formulation is much more readable than the above:

    data_cdf = np.linspace(0.0, 1.0, len(data), endpoint=False)
    xx_cdf = np.linspace(0.0, 1.0, len(xx), endpoint=False)