I'm trying to run a K-S test on some data. Now I have the code working, but I'm not sure I understaned whats going on, and I also get an error when trying to set the loc. Essentially I get both the KS and P-test value. But I'm not sure I fully grasp it, enough to use the result.
I'm using the scipy.stats.ks_2samp module found here.
This is the code I am running
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
Which gives this:
K-S Statistics 0.04507948306145837
P-value 0.8362207851676332
Now for those examples I've seen, the loc is added in as this:
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, loc=0.5, scale=scale)
If I do that however, I get this error:
Traceback (most recent call last):
File "<ipython-input-342-aa890a947919>", line 13, in <module>
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
File "/home/kongstad/anaconda3/envs/tensorflow/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 937, in rvs
args, loc, scale, size = self._parse_args_rvs(*args, **kwds)
TypeError: _parse_args_rvs() got multiple values for argument 'loc'
Here is a snapshot, showing the content of the two datasets being used.
low_ni_sample, high_ni_sample.
So my questions are:
After running Silma's suggestion I stumbled upon a new error.
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
ndist = stats.norm(loc=0., scale=scale)
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
rvs2 = ndist.rvs(high_ni_sample[:,0],size=n2)
#rvs1 = stats.norm.rvs(low_ni_sample[:,2], size=n1, scale=scale)
#rvs2 = stats.norm.rvs(high_ni_sample[:,2], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
With this error message
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
TypeError: rvs() got multiple values for argument 'size'
The error comes from the fact that you should first create an instance of the normal distribution before using it:
ndist = stats.norm(loc=0., scale=scale)
then do
rvs1 = ndist.rvs(size=n1)
to generate n1
samples drawn from a normal distribution centered on 0 and with a standard deviation scale
.
The location is therefore the mean of your distribution.
Changing the scale changes the variance of your distribution (you get more variability), so this obviously impacts the KS test...
As for the plot, I'm not sure I see what you mean... if you want to plot the histograms, then do
import matplotlib.pyplot as plt
plt.hist(rvs1)
plt.show()
Or even better, install seaborn and use their distplot
methods, for instance the KDE.
Overall I would advise you to try to read a little more on distributions and KS tests before you go any further, see for instance the wikipedia page.
EDIT the code shown above is used to generate random samples from a standard distribution (which I assumed was your goal, to compare with your samples).
If what you want to do is directly compare your two sample data, then all you need is
ksresult = stats.ks_2samp(low_ni_sample[:,0], high_ni_sample[:,0])
again, this is assuming that low_ni_sample[:,0]
and high_ni_sample[:,0]
are 1D-arrays containing many measurements of the quantity of interest, cf. ks_2samp documentation