I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0
The histogram of the original data and my code are in the followings: Histogram of my dataset
[In]: frequencies = mp[['c','v']]
[In]: print frequencies
c v
31 3475.8 18.0
30 3475.6 12.0
29 3475.4 13.0
28 3475.2 8.0
20 3475.0 49.0
14 3474.8 69.0
13 3474.6 79.0
12 3474.4 78.0
11 3474.2 78.0
7 3474.0 151.0
6 3473.8 157.0
5 3473.6 129.0
2 3473.4 149.0
1 3473.2 162.0
0 3473.0 179.0
3 3472.8 145.0
4 3472.6 139.0
8 3472.4 95.0
9 3472.2 103.0
10 3472.0 125.0
15 3471.8 56.0
16 3471.6 75.0
17 3471.4 70.0
18 3471.2 70.0
19 3471.0 57.0
21 3470.8 36.0
22 3470.6 22.0
23 3470.4 20.0
24 3470.2 12.0
25 3470.0 23.0
26 3469.8 13.0
27 3469.6 17.0
32 3469.4 6.0
[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)
[In]: testData = list(itertools.chain.from_iterable(testData))
[In]: print len(testData)
2415
[In]: print np.unique(testData)
[ 3469.4 3469.6 3469.8 3470. 3470.2 3470.4 3470.6 3470.8 3471.
3471.2 3471.4 3471.6 3471.8 3472. 3472.2 3472.4 3472.6 3472.8
3473. 3473.2 3473.4 3473.6 3473.8 3474. 3474.2 3474.4 3474.6
3474.8 3475. 3475.2 3475.4 3475.6 3475.8]
[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)
Thanks everyone at first.
Using 'norm'
for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf
with default parameters: loc=0, scale=1
.
Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.
import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt
freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
[3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
[3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
[3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
[3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
[3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
[3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]
data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)
Plot the fit of the norm to the data:
plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()
This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf
of the fitted normal distribution
kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)
So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.