Search code examples
pythonscipystatisticskolmogorov-smirnov

obtaining the critical values needed for the kolmogorov-smirnov test


I'm talking about retrieving the values of this table media a python formula

https://www.soest.hawaii.edu/GG/FACULTY/ITO/GG413/K_S_Table_one_Sample.pdf

i've been looking for a while but scipy functions do not look for this value and tbh I'm getting pretty confused over here.

I've been looking inside scipy built in formulas, without success. For example, in the aforementioned table, D[0.1, 10] == 0.36866. Yet scipy.stats.kstest does NOT return this same value, no matter how much do I play with my data.


Solution

  • This can be done with scipy, using the ksone distribution and its ppf (percent point function) method, rather than the kstest:

    from scipy.stats import ksone
    
    def ks_critical_value(n_trials, alpha):
        return ksone.ppf(1-alpha/2, n_trials)
    

    Printing a table of critical values:

    from __future__ import print_function # For Python 2
    
    trials = range(1, 41)
    alphas = [0.1, 0.05, 0.02, 0.01]
    
    # Print table headers
    print('{:<6}|{:<6} Level of significance, alpha'.format(' ', ' '))
    print('{:<6}|{:>8} {:>8} {:>8} {:>8}'.format(*['Trials'] + alphas))
    print('-' * 42)
    # Print critical values for each n_trials x alpha combination
    for t in trials:
        print('{:6d}|{:>8.5f} {:>8.5f} {:>8.5f} {:>8.5f}'
              .format(*[t] + [ks_critical_value(t, a) for a in alphas]))
        if t % 10 == 0:
            print()
    

    Partial output:

          |       Level of significance, alpha
    Trials|     0.1     0.05     0.02     0.01
    ------------------------------------------
         1|     nan      nan      nan      nan
         2| 0.77639  0.84189      nan      nan
         3| 0.63604  0.70760  0.78456  0.82900
         4| 0.56522  0.62394  0.68887  0.73424
         5| 0.50945  0.56328  0.62718  0.66853
         6| 0.46799  0.51926  0.57741  0.61661
         7| 0.43607  0.48342  0.53844  0.57581
         8| 0.40962  0.45427  0.50654  0.54179
         9| 0.38746  0.43001  0.47960  0.51332
        10| 0.36866  0.40925  0.45662  0.48893
    
        11| 0.35242  0.39122  0.43670  0.46770
        12| 0.33815  0.37543  0.41918  0.44905
        13| 0.32549  0.36143  0.40362  0.43247
        14| 0.31417  0.34890  0.38970  0.41762
        15| 0.30397  0.33760  0.37713  0.40420
        16| 0.29472  0.32733  0.36571  0.39201
        17| 0.28627  0.31796  0.35528  0.38086
        18| 0.27851  0.30936  0.34569  0.37062
        19| 0.27136  0.30143  0.33685  0.36117
        20| 0.26473  0.29408  0.32866  0.35241
    

    We need some additional feedback from a statistician on (a) why we get np.nan values for the top two rows (I assume because the critical values for these combinations of n_trials and alpha are purely theoretical, and not achievable in practice), and (b) why the ksone.ppf method needs alpha to be divided by 2? I will edit this answer to include that information.

    You can see though, that besides the initial missing values, this table generates identical results to the table in your question, and to the table on page 16 of this paper.