python python-3.x scipy normal-distribution scipy.stats

Python Kolmogorov-Smirnov (KS) Test Inconsistent Results

I am trying to compute the KS test specifying the CDF as a array, however, I encountered unexpected results. Upon further evaluation, I found different results based on whether I specified the CDF as a callable, string or array. My code is as follows:

import scipy.stats as st

random_variables = st.norm.rvs(loc=1, scale=1,size=1000000)
cdf_data = st.norm.cdf(random_variables, loc=1,scale=1)
params = st.norm.fit(data=random_variables)
display(params)
print('\n')

#test 1
out = kstest(rvs=random_variables,cdf='norm',args=params)
display(out, out[0], out[1])
print('\n')

#test 2
out = kstest(rvs=random_variables,cdf=st.norm.cdf,args=params)
display(out, out[0], out[1])
print('\n')

#test 3
out = kstest(rvs=random_variables,cdf=cdf_data)
display(out, out[0], out[1])

The results from this code are:

(1.0004825310590526, 0.9996641807017618)


KstestResult(statistic=0.0007348981302804924, pvalue=0.6523439724424506)
0.0007348981302804924
0.6523439724424506


KstestResult(statistic=0.0007348981302804924, pvalue=0.6523439724424506)
0.0007348981302804924
0.6523439724424506


KstestResult(statistic=0.500165, pvalue=0.0)
0.500165
0.0

Given the large sample data is compared against its the exact distribution from which the sample was generated, I expect a failure to reject the null hypothesis. This is the case in test 1 and 2, but it is not the case in test 3. I want to be able to replicate this test using an array argument for the "cdf" argument. Any help as to what I am doing wrong for test 3 would be very helpful. My numpy is version 1.19.2 and scipy is 1.5.2. Thank you!

Solution

I think there are two things that may be contributing to your confusion.

I don't think you want to be comparing to cdf_data = st.norm.cdf(random_variables, loc=1,scale=1). This is returning the value of the cumulative distribution function at all the x values of random-variables. In a KS test you are comparing two distributions, and your cdf_data and random_variable are two very different distributions, so you would expect to get a p-value of 0. I suggest you replace cdf_data with something like random_variable_2 = st.norm.rvs(loc=1,scale=1, size=size)
Additionally you are performing two different KS tests between your first two (one sample) and third (two sample) test. In the first two you compare your data to a fixed functional form to check if the data is consistent with that functional distribution. Since you have the same data and distribution between cases one and two, you would expect the output to be the same. However in case three you are testing two independent distributions to see if they are consistent with each other. Since the cdf_data should just be other normally distributed data points, you should find that the two distributions are consistent, but it should not necessarily give you the exact same answer as the previous two cases, just KS test statistic and p-value that suggests the two data sets come from the same underlying distribution.