I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of the distributions using ks.test
and save the test statistic in each case. The simplest way to implement this conceptually is with a loop:
set.seed(1942)
mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5))
results <- matrix(as.numeric(rep(NA, nrow(mt))))
for (i in 2 : nrow(mt)) {
results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
However, my real data has ~400 columns and ~300,000 rows for a single example, and I have a lot of examples. So I'd like this to be fast. The Kolmogorov-Smirnov test isn't all that mathematically complicated, so if the answer is "implement it in Rcpp
," I'll grudgingly accept that, but I'd be somewhat surprised -- it's already very fast to compute on a single pair in R.
Methods I've tried but have been unable to get working: dplyr
using rowwise/do/lag
, zoo
using rollapply
(which is what I use to generate the distributions), and populating a data.table
in a loop (edit: this one works, but it's still slow).
A quick and dirty implementation in Rcpp
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
double KS(arma::colvec x, arma::colvec y) {
int n = x.n_rows;
arma::colvec w = join_cols(x, y);
arma::uvec z = arma::sort_index(w);
w.fill(-1); w.elem( find(z <= n-1) ).ones();
return max(abs(cumsum(w)))/n;
}
// [[Rcpp::export]]
Rcpp::NumericVector K_S(arma::mat mt) {
int n = mt.n_cols;
Rcpp::NumericVector results(n);
for (int i=1; i<n;i++) {
arma::colvec x=mt.col(i-1);
arma::colvec y=mt.col(i);
results[i] = KS(x, y);
}
return results;
}
for matrix of size (400, 30000)
, it completes under 1s.
system.time(K_S(t(mt)))[3]
#elapsed
# 0.98
And the result seems to be accurate.
set.seed(1942)
mt <- matrix(rnorm(400*30000), nrow=30000)
results <- rep(0, nrow(mt))
for (i in 2 : nrow(mt)) {
results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
result <- K_S(t(mt))
all.equal(result, results)
#[1] TRUE