Search code examples
roptimizationrollapply

Efficiently perform row-wise distribution test


I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of the distributions using ks.test and save the test statistic in each case. The simplest way to implement this conceptually is with a loop:

set.seed(1942)
mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5))

results <- matrix(as.numeric(rep(NA, nrow(mt))))

for (i in 2 : nrow(mt)) {

  results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic

}

However, my real data has ~400 columns and ~300,000 rows for a single example, and I have a lot of examples. So I'd like this to be fast. The Kolmogorov-Smirnov test isn't all that mathematically complicated, so if the answer is "implement it in Rcpp," I'll grudgingly accept that, but I'd be somewhat surprised -- it's already very fast to compute on a single pair in R.

Methods I've tried but have been unable to get working: dplyr using rowwise/do/lag, zoo using rollapply (which is what I use to generate the distributions), and populating a data.table in a loop (edit: this one works, but it's still slow).


Solution

  • A quick and dirty implementation in Rcpp

    // [[Rcpp::depends(RcppArmadillo)]]
    #include <RcppArmadillo.h> 
    
    double KS(arma::colvec x, arma::colvec y) {
      int n = x.n_rows;
      arma::colvec w = join_cols(x, y);
      arma::uvec z = arma::sort_index(w);
      w.fill(-1); w.elem( find(z <= n-1) ).ones();
      return max(abs(cumsum(w)))/n;
    }
    // [[Rcpp::export]]
    Rcpp::NumericVector K_S(arma::mat mt) {
      int n = mt.n_cols; 
      Rcpp::NumericVector results(n);
      for (int i=1; i<n;i++) {
        arma::colvec x=mt.col(i-1);
        arma::colvec y=mt.col(i);
        results[i] = KS(x, y);
        }
      return results;
    }
    

    for matrix of size (400, 30000), it completes under 1s.

    system.time(K_S(t(mt)))[3]
    #elapsed 
    #   0.98 
    

    And the result seems to be accurate.

    set.seed(1942)
    mt <- matrix(rnorm(400*30000), nrow=30000)
    results <- rep(0, nrow(mt))
    for (i in 2 : nrow(mt)) {
      results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
    }
    result <- K_S(t(mt))
    all.equal(result, results)
    #[1] TRUE