I want to compare a reference distribution d_1
with a sample d_2
drawn proportionally to size w_1
using the Kolmogorov–Smirnov distance.
Given that d_2
is weighted, I was considering accounting for this using the Weighted Empirical Cumulative Distribution Function in R (using ewcdf {spatstat}
).
The example below shows that I am probably miss-specifying the weights, because when lenght(d_1) == lenght(d_2)
the Kolmogorov–Smirnov is not giving a value of 0.
Can someone help me with this? For clarity, see the reproducible example below.
#loop for testing sample sizes 1:length(d_1)
d_stat <- data.frame(1:1000, rep(NA, 1000))
names(d_stat) <- c("sample_size", "ks_distance")
for (i in 1:1000) {
#reference distribution
d_1 <- rpois(1000, 500)
w_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, w_1)
#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=w_1, replace=F)),]
d_2 <- m_2$d_1
w_2 <- m_2$w_1
#ewcdf for the reference distribution and the sample
f_d_1 <- ewcdf(d_1)
f_d_2 <- ewcdf(d_2, 1/w_2, normalise=F, adjust=1/length(d_2))
#kolmogorov-smirnov distance
d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}
d_stat[1000,2]
Your code generates some data d1
and associated numeric weights w1
. These data are then treated as a reference population. The code takes a random sample d2
from this population of values d1
, with sampling probabilities proportional to the associated weights w1
. From the sample, you compute the weighted empirical distribution function f_d_2
of the sampled values d2
, with weights inversely proportional to the original sampling probabilities. This function f_d_2
is a correct estimate of the original population distribution function, by the Horvitz-Thompson principle. But it's not exactly equal to the original population distribution, because it's a sample. The Kolmogorov-Smirnov test statistic should not be zero; it should be a small value.