Search code examples
pythonscipystatisticskolmogorov-smirnov

Issues with using parameters for a K-S test and understand the result


I'm trying to run a K-S test on some data. Now I have the code working, but I'm not sure I understaned whats going on, and I also get an error when trying to set the loc. Essentially I get both the KS and P-test value. But I'm not sure I fully grasp it, enough to use the result.

I'm using the scipy.stats.ks_2samp module found here.

This is the code I am running

from scipy import stats

np.random.seed(12345678)  #fix random seed to get the same result
n1 = len(low_ni_sample)  # size of first sample
n2 = len(high_ni_sample)  # size of second sample

# Scale is standard deviation
scale = 3

rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]

print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))

Which gives this:

K-S Statistics 0.04507948306145837
P-value 0.8362207851676332

Now for those examples I've seen, the loc is added in as this:

rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, loc=0.5, scale=scale)

If I do that however, I get this error:

Traceback (most recent call last):

  File "<ipython-input-342-aa890a947919>", line 13, in <module>
    rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)

  File "/home/kongstad/anaconda3/envs/tensorflow/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 937, in rvs
    args, loc, scale, size = self._parse_args_rvs(*args, **kwds)

TypeError: _parse_args_rvs() got multiple values for argument 'loc'

Here is a snapshot, showing the content of the two datasets being used. low_ni_sample, high_ni_sample. enter image description here

So my questions are:

  1. Why cant I add a loc value and what does it represent?
  2. Changing the scale changes the result significantly, why and what to go by?
  3. How would I plot this out in such a way it makes sense?

After running Silma's suggestion I stumbled upon a new error.

from scipy import stats

np.random.seed(12345678)  #fix random seed to get the same result
n1 = len(low_ni_sample)  # size of first sample
n2 = len(high_ni_sample)  # size of second sample

# Scale is standard deviation
scale = 3

ndist = stats.norm(loc=0., scale=scale)

rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
rvs2 = ndist.rvs(high_ni_sample[:,0],size=n2)

#rvs1 = stats.norm.rvs(low_ni_sample[:,2], size=n1, scale=scale)
#rvs2 = stats.norm.rvs(high_ni_sample[:,2], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]

print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))

With this error message

    rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)

TypeError: rvs() got multiple values for argument 'size'

Solution

  • The error comes from the fact that you should first create an instance of the normal distribution before using it:

    ndist = stats.norm(loc=0., scale=scale)
    

    then do

    rvs1 = ndist.rvs(size=n1)
    

    to generate n1 samples drawn from a normal distribution centered on 0 and with a standard deviation scale. The location is therefore the mean of your distribution.

    Changing the scale changes the variance of your distribution (you get more variability), so this obviously impacts the KS test...

    As for the plot, I'm not sure I see what you mean... if you want to plot the histograms, then do

    import matplotlib.pyplot as plt
    plt.hist(rvs1)
    plt.show()
    

    Or even better, install seaborn and use their distplot methods, for instance the KDE.

    Overall I would advise you to try to read a little more on distributions and KS tests before you go any further, see for instance the wikipedia page.

    EDIT the code shown above is used to generate random samples from a standard distribution (which I assumed was your goal, to compare with your samples).

    If what you want to do is directly compare your two sample data, then all you need is

    ksresult = stats.ks_2samp(low_ni_sample[:,0], high_ni_sample[:,0])
    

    again, this is assuming that low_ni_sample[:,0]and high_ni_sample[:,0] are 1D-arrays containing many measurements of the quantity of interest, cf. ks_2samp documentation