I am trying to compute the KS test specifying the CDF as a array, however, I encountered unexpected results. Upon further evaluation, I found different results based on whether I specified the CDF as a callable, string or array. My code is as follows:
import scipy.stats as st
random_variables = st.norm.rvs(loc=1, scale=1,size=1000000)
cdf_data = st.norm.cdf(random_variables, loc=1,scale=1)
params = st.norm.fit(data=random_variables)
display(params)
print('\n')
#test 1
out = kstest(rvs=random_variables,cdf='norm',args=params)
display(out, out[0], out[1])
print('\n')
#test 2
out = kstest(rvs=random_variables,cdf=st.norm.cdf,args=params)
display(out, out[0], out[1])
print('\n')
#test 3
out = kstest(rvs=random_variables,cdf=cdf_data)
display(out, out[0], out[1])
The results from this code are:
(1.0004825310590526, 0.9996641807017618)
KstestResult(statistic=0.0007348981302804924, pvalue=0.6523439724424506)
0.0007348981302804924
0.6523439724424506
KstestResult(statistic=0.0007348981302804924, pvalue=0.6523439724424506)
0.0007348981302804924
0.6523439724424506
KstestResult(statistic=0.500165, pvalue=0.0)
0.500165
0.0
Given the large sample data is compared against its the exact distribution from which the sample was generated, I expect a failure to reject the null hypothesis. This is the case in test 1 and 2, but it is not the case in test 3. I want to be able to replicate this test using an array argument for the "cdf" argument. Any help as to what I am doing wrong for test 3 would be very helpful. My numpy is version 1.19.2 and scipy is 1.5.2. Thank you!
I think there are two things that may be contributing to your confusion.
cdf_data = st.norm.cdf(random_variables, loc=1,scale=1)
. This is returning the value of the cumulative distribution function at all the x values of random-variables
. In a KS test you are comparing two distributions, and your cdf_data
and random_variable
are two very different distributions, so you would expect to get a p-value of 0. I suggest you replace cdf_data
with something like random_variable_2 = st.norm.rvs(loc=1,scale=1, size=size)
cdf_data
should just be other normally distributed data points, you should find that the two distributions are consistent, but it should not necessarily give you the exact same answer as the previous two cases, just KS test statistic and p-value that suggests the two data sets come from the same underlying distribution.