Search code examples
pythontestingscipynormal-distributionkolmogorov-smirnov

p_value is 0 when I use scipy.stats.kstest() for large dataset


I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0

The histogram of the original data and my code are in the followings: Histogram of my dataset

[In]: frequencies = mp[['c','v']]

[In]: print frequencies
         c      v
31  3475.8   18.0
30  3475.6   12.0
29  3475.4   13.0
28  3475.2    8.0
20  3475.0   49.0
14  3474.8   69.0
13  3474.6   79.0
12  3474.4   78.0
11  3474.2   78.0
7   3474.0  151.0
6   3473.8  157.0
5   3473.6  129.0
2   3473.4  149.0
1   3473.2  162.0
0   3473.0  179.0
3   3472.8  145.0
4   3472.6  139.0
8   3472.4   95.0
9   3472.2  103.0
10  3472.0  125.0
15  3471.8   56.0
16  3471.6   75.0
17  3471.4   70.0
18  3471.2   70.0
19  3471.0   57.0
21  3470.8   36.0
22  3470.6   22.0
23  3470.4   20.0
24  3470.2   12.0
25  3470.0   23.0
26  3469.8   13.0
27  3469.6   17.0
32  3469.4    6.0

[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)

[In]: testData = list(itertools.chain.from_iterable(testData))

[In]: print len(testData)
2415

[In]: print np.unique(testData)
[ 3469.4  3469.6  3469.8  3470.   3470.2  3470.4  3470.6  3470.8  3471.
  3471.2  3471.4  3471.6  3471.8  3472.   3472.2  3472.4  3472.6  3472.8
  3473.   3473.2  3473.4  3473.6  3473.8  3474.   3474.2  3474.4  3474.6
  3474.8  3475.   3475.2  3475.4  3475.6  3475.8]

[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)

Thanks everyone at first.


Solution

  • Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.

    Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.

    import numpy as np
    from scipy.stats import norm, kstest
    import matplotlib.pyplot as plt
    
    freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
        [3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
        [3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
        [3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
        [3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
        [3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
        [3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]
    
    data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
    loc, scale = norm.fit(data)
    # create a normal distribution with loc and scale
    n = norm(loc=loc, scale=scale)
    

    Plot the fit of the norm to the data:

    plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
    x = np.arange(data.min(), data.max()+0.2, 0.2)
    plt.plot(x, 350*n.pdf(x))
    plt.show()
    

    enter image description here

    This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution

    kstest(data, n.cdf)
    # returns:
    KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)
    

    So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.