Fast way to permute rows of matrices in Rcpp or RcppArmadillo?

I'm running a stationary bootstrap algorithm on an N x M matrix, X, where both N and M are on the order of 1500 to 3000.

The bootstrap matrix of index permutations, Y, is N x B, where B is, say, 10,000.

In R syntax, the goal is to compute:

sapply(1:B, function(b) colSums(X[Y[,b],]))

That is, we rearrange the rows of X (with possible duplications) and take the column sums--10,000 times.

The above code takes about 3 minutes with N = 1500, M = 2000, B = 10000.

Converting the code to Rcpp reduces it to about 25 seconds:

// [[Rcpp::export]]
NumericVector get_bootstrap_statistic_mean(NumericMatrix x, IntegerMatrix theta){
    
    int nr = x.nrow();
    int nc = x.ncol(); 
    int nb = theta.ncol();
    NumericMatrix res(nb, nc);
    
        
    for(int j = 0; j < nc; j++){
        
        NumericMatrix::Column x_col_j = x.column(j);
        NumericMatrix::Column res_col_j = res.column(j);
        
        for(int b = 0; b < nb; b++){
    
            IntegerMatrix::Column theta_col_b = theta.column(b);
            
            double sum = 0.0;
            for(int i = 0; i < nr; i++){
            
                sum += x_col_j[theta_col_b[i]-1]; //Have to subtract one to map the R indices (start at 1) to the C++ indices (start at 0)
            
            }
            res_col_j[b] = sum / nr;
        }
    }
    return res;
}

But this post shows a faster way to get column sums than the nested loops above.

Thus, is there a way to create the permuted matrices in C++ (Rcpp, RcppArmadillo) that is faster than doing :

sapply(1:B, function(b) Arma_colSums(X[Y[,b],])

which takes about 20 seconds for N = 1500, M = 2000, B = 10000 (where Arma_colSums was (more or less) the fastest method from the linked post), so that we can apply Arma_colSums to the permuted matrices?

I've looked at the RcppArmadillo subsetting documents and most of it seems like it's fetching row or column vectors rather than rearranging the entire matrix. And the "sub2ind" functionality seems not applicable or if it is applicable, that it would take longer to put the permutation matrix in the required form than using one of the faster approaches above.

I've also looked at the bootstrap example in the RcppArmadillo introduction, but it uses IID bootstrap on a single column of X (whereas X here has thousands of columns).

I tried chatGPT, but it wasn't able to provide any compilable code.

Solution

Use matrix multiplication instead:

library(Rfast) # for `colTabulate` and `Crossprod`

system.time(Z1 <- get_bootstrap_statistic_mean(X, Y))
#>    user  system elapsed 
#>   28.17    0.01   28.19
system.time(Z2 <- Crossprod(X, colTabulate(Y, N)))
#>    user  system elapsed 
#>   26.30    1.36    3.81

all.equal(Z1, t(Z2)/N)
#> [1] TRUE

Data:

N <- 15e2L
M <- 2e3L
B <- 1e4L

X <- matrix(runif(N*M), N, M)
Y <- matrix(sample(N, N*B, 1), N, B)