Search code examples
rcdf

Manually calculate two-sample kolmogorov-smirnov using ECDF


I am trying to manually compute the KS statistic for two random samples. As far as I understood the KS statistic D is the maximum vertical deviation between the two CDFs. However, manually calculating the differences between the two CDF and running the ks.test from the base R yields different results. I wonder where is the mistake.

set.seed(123)
a <- rnorm(10000)
b <- rnorm(10000)

### Manual calculation 
# function for calculating manually the ecdf
decdf <- function(x, baseline, treatment)  ecdf(baseline)(x) - ecdf(treatment)(x)
#Difference between the two CDFs 
d <- curve(decdf(x,a,b), from=min(a,b), to=max(a,b))
# getting D 
ks <- max(abs(d$y))

#### R-Base calculation 

ks.test(a,b)

The R-Base D = 0.0109 while the manual calculation is 0.0088. Any help explaining the difference is appreciated.

I attach the R-Base source code ( a bit cleaned up)


n <- length(a)                                                                          
n.x <- as.double(n)
n.y <- length(b)
n <- n.x * n.y/(n.x + n.y)
w <- c(a, b)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
STATISTIC <- max(abs(z))



Solution

  • By default, curve evaluates the function on a subdivision of 100 points between from and to. By restricting to these 100 points, it's possible that you miss the value for which the maximum difference is attained.

    Instead, evaluate the difference at all points where the ecdf's jump and you are sure to catch the value for which the maximum difference is attained.

    set.seed(123)
    a <- rnorm(10000)
    b <- rnorm(10000)
    Fa <- ecdf(a)
    Fb <- ecdf(b)
    
    x <- c(a,b) # the points where Fa or Fb jump
    
    max(abs(Fa(x) - Fb(x)))
    # [1] 0.0109