Speed up Rcpp evaluations within R loop

It is well known that implementations in Rcpp will, in general, be much faster than implementations in R. I am interested in whether there are good practices to speed up single evaluations of Rcpp functions that have to be evaluated within a R loop.

Consider the following example where I use a simple multivariate normal generating function in Rcpp:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

using namespace arma; 
using namespace Rcpp;

// [[Rcpp::export]]
mat mvrnormArma(int n, mat sigma) {
   int ncols = sigma.n_cols;
   mat Y = randn(n, ncols);
   return Y * chol(sigma);
}

Assume the goal is to generate 10,000 10-dimensional multivariate normal variables using the following two functions:

PureRcpp = function(n){mvrnormArma(n, diag(10))}
LoopRcpp = function(n){for(ii in 1:n){mvrnormArma(1, diag(10))}}

Here, PureRcpp is of course the preferable and much faster solution. However, in certain applications it may be necessary to rely on single evaluations of mvrnormArma in an R loop. This is the approach taken in LoopRcpp which is certainly the slower solution. I was, however, a bit surprised when i benchmarked these and saw HOW slow the second solution really was:

> microbenchmark::microbenchmark(PureRcpp(10000), LoopRcpp(10000))
Unit: milliseconds
            expr       min        lq      mean    median        uq      max neval cld
 PureRcpp(10000)  2.236624  2.365988  2.578869  2.435268  2.565488 10.79609   100  a 
 LoopRcpp(10000) 52.590143 53.315655 58.080897 55.406020 62.264711 80.96275   100   b

Is this massive slow-down just something that we have to live with when we have to work in R loops or are there some possibilities to reduce the overhead that results due to looping? I'm aware that we could just rewrite everything in C++, but the goal is to offer a speedy ''Rcpp within R loop'' solution if possible.

Solution

As Roland pointed out that is mostly due to function calls. However, you can shave off some time (and get a more accurate comparison) by optimising/adapting your code.

Pass to the Cpp function by reference
Don't create the diagonal in the loop
Use a vector in the single dispatch
Draw vectorised random numbers

// [[Rcpp::export]]
mat draw_randn(int n, int ncols) {
  mat Y = randn(n, ncols);
  return(Y);
}
// [[Rcpp::export]]
mat mvrnormArma(mat sigma, mat Y) {
  return Y * chol(sigma);
}
// [[Rcpp::export]]
mat mvrnormArma_loop(mat& sigma, rowvec& Y) {
  return Y * chol(sigma);
}

And benchmark that.

PureRcpp = function(n) {
  Y <- draw_randn(n, 10)
  I <- diag(10)
  mvrnormArma(I, Y)
}
LoopRcpp = function(n) {
  Y <- draw_randn(n, 10)
  I <- diag(10)
  for(ii in 1:n) {mvrnormArma_loop(I, Y[ii, ])}
}

Shaves off about 10ms for me.