Search code examples
rgroupingaggregatesummary

How to increase the speed of aggregating and summarizing multiple variables in R?


I am doing a resampling (i.e., bootstrap) procedure that involves, as one of the steps that gets repeated, calculating the mean of multiple numeric variables for each of multiple groups. I have found solutions that are pretty straight-forward using dplyr, doBy, and data.table, which I provide below.

However, each of them typically takes over a millisecond to complete (as per microbenchmark). Since this process will be repeated several thousand times (along with other operations), I would like to optimize it as much as possible. Ideally, it would complete in microseconds or faster.

Can anyone think of a way to increase the speed of these operations? One idea I had was to convert the numeric variables to a matrix and use colMeans(), but wasn't sure how to do the subsetting fast.

dat <- data.frame(
  a = runif(1000),
  b = runif(1000),
  c = runif(1000),
  group = factor(rep(c(1, 2), 500))
)

library(dplyr)
dat %>% group_by(group) %>% summarise_all(mean)
#microbenchmark = 7.1 milliseconds

library(doBy)
summaryBy(. ~ group, dat, FUN = mean)
#microbenchmark = 4.6 milliseconds

library(data.table)
setDT(dat)[, lapply(.SD, mean), by = 'group']
#microbenchmark = 1.8 milliseconds

#base
mat <- as.matrix(dat[, 1:(ncol(dat) - 1)])
grp <- dat$group
by(mat, grp, colMeans)
#microbenchmark = 1.2 milliseconds

Update:

To provide more information about my broader task, I am creating a function that will take in data on k variables from n subjects in g mutually-exclusive groups in the form of a data frame (n-by-k). The main purpose of the function is to first aggregate the data by taking the mean of each variable within each group (g-by-k) and second to apply a statistical function to the mean vector for each group separately (1-by-k). This statistical function returns p estimates of parameters of interest.

Furthermore, bootstrapped confidence intervals for these estimates need to be calculated, so the function estimates these parameters for each of r resamples with replacement from the original data frame (stratified by group). Ultimately, I need to know the parameter estimates for each group from each resample (p-by-g-by-r) so that I can use percentiles or some other approach to estimate the confidence interval for each parameter in each group.

Note that I have already successfully optimized the statistical function, which now takes around 50 microseconds to complete on the most common vector size. Thus, the remaining bottleneck seems to be the creation of these vectors for each resample (i.e., the aggregating and summarizing).


Solution

  • I was able to get to the order of microseconds using Rcpp and RcppArmadillo.

    dat <- data.frame(
      a = runif(1000),
      b = runif(1000),
      c = runif(1000),
      group = factor(rep(c(1, 2), 500))
    )
    mat <- as.matrix(dat[, 1:(ncol(dat) - 1)])
    grp <- as.integer(dat$group)
    group_scores(mat, grp)
    #microbenchmark: 48 microseconds
    

    Below is the Rcpp code for the group_scores function:

    # include <RcppArmadillo.h>
    # include <RcppArmadilloExtensions/sample.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    
    using namespace Rcpp;
    
    //[[Rcpp::export]]
    arma::mat submat(NumericMatrix X, NumericVector T, int TestVal) {
      arma::mat Xmat(X.begin(), X.nrow(), X.ncol(), false);
      arma::colvec tIdx(T.begin(), T.size(), false); 
      arma::mat y = Xmat.rows(find(tIdx == TestVal));
      return y;
    }
    
    // [[Rcpp::export]]
    arma::rowvec col_means(arma::mat x){
      arma::mat X = arma::mat(x.begin(), x.n_rows, x.n_cols, false); 
      return arma::mean(X, 0); 
    }
    
    //[[Rcpp::export]]
    arma::mat group_scores(NumericMatrix X, NumericVector T) {
      NumericVector levels = unique(T);
      int n = levels.size();
      int m = X.ncol();
      arma::mat out(n, m);
      for (int i(0); i < n; i++) {
        int level = levels(i);
        arma::mat sub = submat(X, T, level);
        arma::rowvec colmeans = col_means(sub);
        out.row(i) = colmeans;
      }
      return out;
    }