Search code examples
rrcpp

It seems it is a bit slow to extract element from a List in Rcpp


I just wrote a Rcpp function with three same size input vectors, x(numeric) y(numeric) and category(character). Then I want to return a list, the list size is equal to the length of unique category values. Each element in this list is a same size matrix (equal rows and columns) based on x and y with corresponding category.

However, I found my code is not fast enough when the size of n is huge. I think the reason is I need to extract something from the list, do some computation and insert it back every time. Does anyone have suggestions on how to speed up the process.

Rcpp code

#include <Rcpp.h>
using namespace Rcpp;

//[[Rcpp::export]]
List myList(NumericVector x, NumericVector y, CharacterVector category) {

  int n = x.size();
  CharacterVector levels = unique(category);
  int levels_size = levels.size();
  List L(levels_size);

  int plot_width = 600;
  int plot_height = 600;

  // Each element in the list L has the same size Matrix
  for(int j = 0; j < levels_size; j++) {
    NumericMatrix R(plot_height, plot_width);
    L[j] = R;
  }
  int id = 0;

  double xmax = max(x);
  double ymax = max(y);
  double xmin = min(x);
  double ymin = min(y);

  for(int i=0; i < n; i++) {

    for(int j = 0; j < levels_size; j++) {
      if(category[i] == levels[j]) {
        id = j;
        break;
      }
    }

    int id_x = floor((x[i] - xmin)/(xmax - xmin) * (plot_width - 1));
    int id_y = floor((y[i] - ymin)/(ymax - ymin) * (plot_height - 1));

    NumericMatrix M = L[id];
    // some computation in M
    M(id_y, id_x) += 1;
    L[id] = M;
  }
  return(L);
}

R code

n <- 1e8
class <- 20

x <- rnorm(n)
y <- rnorm(n)
category <- sample(as.factor(1:class), size = n, replace = TRUE)

start_time <- Sys.time()
L <- myList(x = x, y = y, category = category)
end_time <- Sys.time()
end_time - start_time
# Time difference of 35.3367 secs

Solution

  • I suspect two main problems concerning performance:

    • Lots of string comparisons (of the order of 1e9)
    • Lots of cache misses for the matrices, since in general two consecutive xy-pairs won't be from the same category and will therefore need a different matrix

    Both indicate into the same direction: Do not try to implement your own GROUP BY operations. Database engines and packages like data.table know better how to do that. For example, when using data.table we need a much simpler function that expects the x and y for one category and outputs a single matrix:

    #include <Rcpp.h>
    using namespace Rcpp;
    
    //[[Rcpp::export]]
    NumericMatrix getMat(NumericVector x, NumericVector y,
                         double xmin, double xmax, double ymin, double ymax,
                         int plot_width = 600, int plot_height = 600) {
        int n = x.size();
        NumericMatrix M(plot_height, plot_width);
    
        for(int i=0; i < n; i++) {
            int id_x = floor((x[i] - xmin)/(xmax - xmin) * (plot_width - 1));
            int id_y = floor((y[i] - ymin)/(ymax - ymin) * (plot_height - 1));
            M(id_y, id_x) += 1;
        }
        return M;
    }
    
    /***R
    n <- 1e8
    class <- 20
    
    library("data.table")
    foo <- data.table(x = rnorm(n),
                      y = rnorm(n),
                      category = sample(as.factor(1:class), size = n, replace = TRUE))
    
    xmin <- min(foo$x)
    xmax <- max(foo$x)
    ymin <- min(foo$y)
    ymax <- max(foo$y)
    
    system.time(bar <- foo[,
                           list(baz = list(getMat(x, y, xmin, xmax, ymin, ymax))),
                           by = category])
    */
    

    Notes:

    • On my system the aggregation takes less than 6 seconds.
    • It is even faster if one does a setkey(foo, category) before the aggregation. That physically alters the order of the rows, though. Use with care!
    • data.table syntax is a bit terse, but one gets used to it ...
    • The structure of the output is different, but can be converted if needed.