Search code examples
pythonscipykolmogorov-smirnov

kstest gives strange p-values


I want to check if a probability comes from a distribution specified by an empirical CDF. kstest gives p-values that I consider to be wrong; what's the matter?

I have written a test function to verify the p-values. I am comparing sample arrays from two identical distributions and check p-values obtained from kstest and ks_2samp functions. Since the null hypothesis is true (the distributions are identical), the p-values have to be uniformly distributed on [0,1], in other words, I must see the false discovery rate equal to the used p-value threshold. However, this is the case only for p-values given by ks_2samp function.

from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF

def test():
    num_runs = 1000
    detected_kstest= 0
    detected_ks_2samp = 0

    for _ in range(num_runs):
        data1 = stats.norm.rvs(size=100, loc=1.0, scale=1.0)
        data2 = stats.norm.rvs(size=100, loc=1.0, scale=1.0)

        ecdf = ECDF(data1)
        p_threshold = 0.05

        _, p_val = stats.kstest(data2, ecdf)
        if p_val < p_threshold:
            detected_kstest += 1

        _, p_val = stats.ks_2samp(data1, data2)
        if p_val < p_threshold:
            detected_ks_2samp += 1

    print(f'FDR for p-value threshold {p_threshold} : kstest: {detected_kstest / num_runs}, ks_2samp: {detected_ks_2samp / num_runs}')

The output is

FDR for p-value threshold 0.05 : kstest: 0.287, ks_2samp: 0.051

I expect both fdr values to be near 0.05, but the value given by kstest is weird (too high - in other words, kstest often insists that the data comes from different distributions).

Am I missing something?

UPDATE

As answered below, the reason is that kstest doesn't handle ecdf generated by small samples well enough... Alas, I have to generate empirical CDFs by samples that are also not very big. For now, as a quick workaround, I use some 'hybrid' method:

def my_ks_test(data, ecdf, ecdf_n=None):
    n = data.size
    sorted_data = np.sort(data)
    data_cdf = np.searchsorted(sorted_data, sorted_data, side='right')/(1.0 * n)

    data_cdf_by_ecdf = ecdf(sorted_data)

    d = np.max(np.absolute(data_cdf - data_cdf_by_ecdf))

    if ecdf_n is None:
        en = np.sqrt(n)
    else:
        en = np.sqrt(n * ecdf_n/float(n + ecdf_n))

    try:
        p_val = stats.distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d)
    except:
        p_val = 1.0

    return p_val    

So it can take as an argument the number of samples that was used in generation of ECDF. Maybe that's not exactly nathematically strict, by for now it's the best I can come up with. When tested on data1 and data2 both of size 100, it gives

FDR for p-value threshold 0.05 : kstest: 0.268, ks_2samp: 0.049, my_ks_test: 0.037


Solution

  • The ECDF that you compute approximates a normal distribution, but if you test a large enough sample from an actual normal distribution with that ECDF, kstest will detect that the sample is not from the ECDF. After all, the ECDF is not a normal distribution.

    Apparently a sample size of 100 (from an actual normal distribution) is large enough that kstest often detects that those samples are not from the distribution associated with the ECDF that is based on data1.

    If you increase the size of data1 while keeping the size of data2 fixed, you will eventually get the result that you expect. By increasing the size of data1, you increase how well the ECDF approximates an actual normal distribution.

    When I change the creation of data1 to

            data1 = stats.norm.rvs(size=5000, loc=1.0, scale=1.0)
    

    here's what I get:

    In [121]: test()                                                                                     
    FDR for p-value threshold 0.05 : kstest: 0.048, ks_2samp: 0.0465
    
    In [122]: test()                                                                                     
    FDR for p-value threshold 0.05 : kstest: 0.0515, ks_2samp: 0.0475
    
    In [123]: test()                                                                                     
    FDR for p-value threshold 0.05 : kstest: 0.0515, ks_2samp: 0.05