Search code examples
rrcpp

Why is the Rcpp implementation in my example much slower than the R function?


I have some experience in C++ and R but am a newbie to Rcpp. Recently I had a huge success by using Rcpp in some of my previous projects, so decided to apply it to a new project. I was surprised that my Rcpp code could be much slower than the corresponding R function. I have tried to simplify my R function to figure out the reason but cannot find any clue. Your help and comments are very welcome!

The main R function to compare R and Rcpp implementations:

main <- function(){

  n <- 50000
  Delta <- exp(rnorm(n))
  delta <- exp(matrix(rnorm(n * 5), nrow = n))
  rx <- matrix(rnorm(n * 20), nrow = n)
  print(microbenchmark(c1 <- test(Delta, delta, rx), times = 500))
  print(microbenchmark(c2 <- rcpp_test(Delta, delta, rx), times = 500))

  identical(c1, c2)
  list(c1 = c1, c2 = c2)
}

R implementation:

test <- function(Delta, delta, rx){

  const <- list()
  for(i in 1:ncol(delta)){
    const[[i]] <- rx * (Delta / (1 + delta[, i]))
  }

  const

}

Rcpp implementation:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List rcpp_test(NumericVector Delta, 
               NumericMatrix delta, 
               NumericMatrix rx) {

  int n = Delta.length();
  int m = rx.ncol();

  List c; 
  NumericMatrix c1;
  for(int i = 0; i < delta.ncol(); ++i){
    c1 = NumericMatrix(n, m);
    for(int k = 0; k < n; ++k){
      double tmp = Delta[k] / (1 + delta(k, i));
      for(int j = 0; j < c1.ncol(); ++j){
        c1(k, j) = rx(k, j) * tmp; 
      }
    }
    c.push_back(c1);
  }

  return c;

}

I understand that there is no guarantee of improved efficiency by using Rcpp, but given the simple example I show here, I do not see why the Rcpp code runs so slowly.

Unit: milliseconds
                         expr      min       lq     mean   median       uq      max neval
 c1 <- test(Delta, delta, rx) 13.16935 14.19951 44.08641 30.43126 73.78581 115.9645   500
Unit: milliseconds
                              expr      min       lq     mean  median       uq      max neval
 c2 <- rcpp_test(Delta, delta, rx) 143.1917 158.7481 171.6116 163.413 173.7677 247.5495   500

Ideally rx is a list of matrices in my project. The variable i in the for loop will be used to pick an element for computing. At the beginning I suspected that passing a List to Rcpp could have a high overhead, so in this example, I assumed rx to be a fixed matrix being used for all i. Seems like that is not the reason for the slowness.


Solution

  • Your R code seems to be more or less optimal, i.e. all the real work is done in compiled code. For the C++ code the main problem I could find is calling c1.ncol() in a tight loop. If I replace that with m, the C++ solution is almost as fast as R. If I add RcppArmadillo to the mix, I get a very compact syntax, but not faster than the pure Rcpp code. To me this shows that it can be really hard to beat well written R code:

    //  [[Rcpp::depends(RcppArmadillo)]]
    #include <RcppArmadillo.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    List arma_test(const arma::vec& Delta,
               const arma::mat& delta,
               const arma::mat& rx) {
      int l = delta.n_cols;
      List c(l);
    
      for (int i = 0; i < l; ++i) {
        c(i) = rx.each_col() % (Delta / (1 + delta.col(i)));
      }
    
      return c;  
    }
    
    // [[Rcpp::export]]
    List rcpp_test(NumericVector Delta, 
                   NumericMatrix delta, 
                   NumericMatrix rx) {
    
      int n = Delta.length();
      int m = rx.ncol();
    
      List c(delta.ncol()); 
      NumericMatrix c1;
      for(int i = 0; i < delta.ncol(); ++i){
        c1 = NumericMatrix(n, m);
        for(int k = 0; k < n; ++k){
          double tmp = Delta[k] / (1 + delta(k, i));
          for(int j = 0; j < m; ++j){
            c1(k, j) = rx(k, j) * tmp; 
          }
        }
        c(i) = c1;
      }
    
      return c;
    
    }
    
    /*** R
    test <- function(Delta, delta, rx){
    
      const <- list()
      for(i in 1:ncol(delta)){
        const[[i]] <- rx * (Delta / (1 + delta[, i]))
      }
    
      const
    
    }
    
    n <- 50000
    Delta <- exp(rnorm(n))
    delta <- exp(matrix(rnorm(n * 5), nrow = n))
    rx <- matrix(rnorm(n * 20), nrow = n)
    bench::mark(test(Delta, delta, rx),
                arma_test(Delta, delta, rx),
                rcpp_test(Delta, delta, rx))
     */
    

    Output:

    # A tibble: 3 x 14
      expression     min    mean  median     max `itr/sec` mem_alloc  n_gc n_itr
      <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
    1 test(Delt…  84.3ms  85.2ms  84.9ms  86.6ms     11.7     44.9MB     2     4
    2 arma_test… 106.5ms 107.7ms 107.7ms 108.9ms      9.28    38.1MB     3     2
    3 rcpp_test… 101.9ms 103.2ms 102.2ms 106.6ms      9.69    38.1MB     1     4
    # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
    #   time <list>, gc <list>
    

    I have also explicitly initialized the output list to the required size avoiding the push_back, but that did not make a big difference. With vector like data structures from Rcpp you should definitely avoid using push_back though, since there a copy will be made every time you extend the vector.