I am learning statitistics, and i want to check a data's distribution, to find if it comes from normal distribution.
I find a ks-test can do this. my code list below:
In [1]: from scipy import stats
In [2]: from read_cj import read
In [3]: df = read()
[read] cost 10.066437721252441
In [4]: stats.kstest(df['XH.self_rank(30)'],'norm')
Out[4]: KstestResult(statistic=0.3203690716401366, pvalue=0.0)
this result seems mean my colums XH.self_rank(30)
is normal distribution.
but the hist plot shows like:
I dont think it comes from normal distribution.
and i tried more:
In [9]: stats.kstest([1,2,3,4], 'norm')
Out[9]: KstestResult(statistic=0.8413447460685429, pvalue=0.0012672077773713667)
In [10]: stats.kstest([1]*10000, 'norm')
Out[10]: KstestResult(statistic=0.8413447460685429, pvalue=0.0)
as you can see, the [1]*10000
is stilled considered comes from normal distribution, and [1]*10000
has same statistic value with [1, 2, 3,4]
, but different p-value. this confused me.
i think this kind of hist plot is normal distribution:
did i miss anything? can you help on this?
The null hypothesis of Kolmogorov-Smirnov test is that the sample comes from a normal distribution. So a p-value near zero rejects normality.
from scipy import stats
import random
print(stats.kstest([1] * 1000, 'norm').pvalue) # 0.0
print(stats.kstest([random.gauss(0, 1) for _ in range(1000)], 'norm').pvalue) # 0.7275173462861986
You can see that the uniform-ish sample leads to a p-value of zero, strongly suggesting this is not normal. On the other hand, the normal sample indeed leads to a large p-value, (correctly) suggesting that the sample is from a normal distribution.
The same applies to your case. All the suspected samples show p-values near zero, indicating that they are not from normal distributions. So stats.kstest
is not broken in my opinion.