I want to check if a probability comes from a distribution specified by an empirical CDF. kstest
gives p-values that I consider to be wrong; what's the matter?
I have written a test function to verify the p-values. I am comparing sample arrays from two identical distributions and check p-values obtained from kstest
and ks_2samp
functions. Since the null hypothesis is true (the distributions are identical), the p-values have to be uniformly distributed on [0,1], in other words, I must see the false discovery rate equal to the used p-value threshold.
However, this is the case only for p-values given by ks_2samp
function.
from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF
def test():
num_runs = 1000
detected_kstest= 0
detected_ks_2samp = 0
for _ in range(num_runs):
data1 = stats.norm.rvs(size=100, loc=1.0, scale=1.0)
data2 = stats.norm.rvs(size=100, loc=1.0, scale=1.0)
ecdf = ECDF(data1)
p_threshold = 0.05
_, p_val = stats.kstest(data2, ecdf)
if p_val < p_threshold:
detected_kstest += 1
_, p_val = stats.ks_2samp(data1, data2)
if p_val < p_threshold:
detected_ks_2samp += 1
print(f'FDR for p-value threshold {p_threshold} : kstest: {detected_kstest / num_runs}, ks_2samp: {detected_ks_2samp / num_runs}')
The output is
FDR for p-value threshold 0.05 : kstest: 0.287, ks_2samp: 0.051
I expect both fdr values to be near 0.05, but the value given by kstest
is weird (too high - in other words, kstest
often insists that the data comes from different distributions).
Am I missing something?
UPDATE
As answered below, the reason is that kstest
doesn't handle ecdf generated by small samples well enough...
Alas, I have to generate empirical CDFs by samples that are also not very big.
For now, as a quick workaround, I use some 'hybrid' method:
def my_ks_test(data, ecdf, ecdf_n=None):
n = data.size
sorted_data = np.sort(data)
data_cdf = np.searchsorted(sorted_data, sorted_data, side='right')/(1.0 * n)
data_cdf_by_ecdf = ecdf(sorted_data)
d = np.max(np.absolute(data_cdf - data_cdf_by_ecdf))
if ecdf_n is None:
en = np.sqrt(n)
else:
en = np.sqrt(n * ecdf_n/float(n + ecdf_n))
try:
p_val = stats.distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d)
except:
p_val = 1.0
return p_val
So it can take as an argument the number of samples that was used in generation of ECDF. Maybe that's not exactly nathematically strict, by for now it's the best I can come up with. When tested on data1 and data2 both of size 100, it gives
FDR for p-value threshold 0.05 : kstest: 0.268, ks_2samp: 0.049, my_ks_test: 0.037
The ECDF that you compute approximates a normal distribution, but if you test a large enough sample from an actual normal distribution with that ECDF, kstest
will detect that the sample is not from the ECDF. After all, the ECDF is not a normal distribution.
Apparently a sample size of 100 (from an actual normal distribution) is large enough that kstest
often detects that those samples are not from the distribution associated with the ECDF that is based on data1
.
If you increase the size of data1
while keeping the size of data2
fixed, you will eventually get the result that you expect. By increasing the size of data1
, you increase how well the ECDF approximates an actual normal distribution.
When I change the creation of data1
to
data1 = stats.norm.rvs(size=5000, loc=1.0, scale=1.0)
here's what I get:
In [121]: test()
FDR for p-value threshold 0.05 : kstest: 0.048, ks_2samp: 0.0465
In [122]: test()
FDR for p-value threshold 0.05 : kstest: 0.0515, ks_2samp: 0.0475
In [123]: test()
FDR for p-value threshold 0.05 : kstest: 0.0515, ks_2samp: 0.05