Extracting a submatrix with Rcpp

In a package of mine, I extract a submatrix B from a matrix A as follows:

NumericMatrix extractColumns(NumericMatrix A, IntegerVector indices) {
  int n = indices.size();
  int m = A.nrow();
  NumericMatrix B(m, n);
  for(int i = 0; i < n; i++) {
    B(_, i) = A(_, indices(i));
  }
  return B;
}

I observed it is slow. Then I've just benchmarked and this is indeed slower than the ordinary R way:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix extractColumns(NumericMatrix A, IntegerVector indices) {
  int n = indices.size();
  int m = A.nrow();
  NumericMatrix B(m, n);
  for(int i = 0; i < n; i++) {
    B(_, i) = A(_, indices(i));
  }
  return B;
}

//

/*** R
library(microbenchmark)
A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
indices <- 2000L:40000L 
microbenchmark(
  R    = A[, indices],
  Rcpp = extractColumns(A, indices - 1L),
  times = 5L
)
*/

Why is it slow and is there a faster way with Rcpp? (the indices are not contiguous in general).

Edit

Also tested with RcppEigen. Slow as well.

Eigen::MatrixXd extractColumns2(Eigen::MatrixXd A, Rcpp::IntegerVector indices) {
  int m = A.rows();
  int n = indices.size();
  Eigen::MatrixXd B(m, n);
  for(int i = 0; i < n; i++) {
    B.col(i) = A.col(indices(i));
  }
  return B;
}

Solution

This is an interesting question, and worth comparing a few approaches. I also think it is worth reiterating that we almost always have a tradeoff between clarity and performance. As such, I still like RcppArmadillo (even though the second measurement set below looks a little worrisome).

First off, the code. I added a function for RcppArmadillo which, given a vector of unsigned indices, can index directly which makes it a one-liner. Which clearly wins on clarity.

Code

#include <RcppArmadillo.h>
#include <RcppEigen.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericMatrix extractColumns(Rcpp::NumericMatrix A, Rcpp::IntegerVector indices) {
  int n = indices.size();
  int m = A.nrow();
  NumericMatrix B(m, n);
  for(int i = 0; i < n; i++) {
    B(_, i) = A(_, indices(i));
  }
  return B;
}

// [[Rcpp::export]]
NumericMatrix extractColumns1(NumericMatrix A, IntegerVector indices) {
  int n = indices.size();
  int m = A.nrow();
  NumericMatrix B(m, n);
  for(int i = 0; i < m; i++) {
    for(int j = 0; j < n; j++) {
      B(i, j) = A(i, indices(j));
    }
  }
  return B;
}

// [[Rcpp::depends(RcppEigen)]]

// [[Rcpp::export]]
Eigen::MatrixXd extractColumns2(Eigen::MatrixXd A, Rcpp::IntegerVector indices) {
  int m = A.rows();
  int n = indices.size();
  Eigen::MatrixXd B(m, n);
  for(int i = 0; i < n; i++) {
    B.col(i) = A.col(indices(i));
  }
  return B;
}

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::mat extractColumns3(arma::mat A, arma::uvec v) {
    return A.cols(v);
}

Benchmark

I made two changes. First, I added collapse besides (Rcpp)Armadillo as it explicitly aims for performance (for the R user) which efficient and pure code. Second, because the original matrix was a bit degenerate with two rows times 50k colums, I create a somewhat more balanced 200 by 500 one.

/*** R
library(microbenchmark)
A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
indices <- 2000L:40000L
microbenchmark(
  R         = A[, indices],
  RcppBasic = extractColumns(A, indices - 1L),
  RcppImprv = extractColumns1(A, indices - 1L),
  RcppEigen = extractColumns2(A, indices - 1L),
  RcppArma  = extractColumns3(A, indices - 1L),
  collapse  = collapse::ss(A, 1:2, indices),
  times = 25L
)

A <- matrix(rgamma(100000L, 5, 1), nrow = 200L, ncol = 500L)
indices <- 200L:400L
microbenchmark(
  R         = A[, indices],
  RcppBasic = extractColumns(A, indices - 1L),
  RcppImprv = extractColumns1(A, indices - 1L),
  RcppEigen = extractColumns2(A, indices - 1L),
  RcppArma  = extractColumns3(A, indices - 1L),
  collapse  = collapse::ss(A, 1:200, indices),
  times = 25L
)

all.equal(A[, indices], extractColumns(A, indices - 1L))
all.equal(A[, indices], extractColumns2(A, indices - 1L))
all.equal(A[, indices], extractColumns3(A, indices - 1L))
all.equal(A[, indices], collapse::ss(A, 1:200, indices))

*/

Results

> Rcpp::sourceCpp("~/git/stackoverflow/76538309/question.cpp")

> library(microbenchmark)

> A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)

> indices <- 2000L:40000L

> microbenchmark(
+ >   R         = A[, indices],
+ >   RcppBasic = extractColumns(A, indices - 1L),
+ >   RcppImprv = extractColumns1(A, indices - 1L),
+ >   .... [TRUNCATED] 
Unit: microseconds
      expr      min       lq     mean   median       uq      max neval cld
         R  198.698  201.356  601.502  247.764  406.431 4636.895    25 ab 
 RcppBasic 1260.979 1337.403 1742.815 1522.285 1789.843 4884.915    25   c
 RcppImprv  310.913  339.943  623.045  533.361  587.319 2480.557    25 ab 
 RcppEigen  293.941  502.759  922.591  772.874 1024.857 4161.447    25 a  
  RcppArma  299.430  330.409  811.221  786.544 1072.527 1997.107    25 ab 
  collapse  201.500  210.785  316.888  327.262  409.214  465.755    25  b 

> A <- matrix(rgamma(100000L, 5, 1), nrow = 200L, ncol = 500L)

> indices <- 200L:400L

> microbenchmark(
+ >   R         = A[, indices],
+ >   RcppBasic = extractColumns(A, indices - 1L),
+ >   RcppImprv = extractColumns1(A, indices - 1L),
+ >   .... [TRUNCATED] 
Unit: microseconds
      expr     min      lq     mean  median      uq      max neval cld
         R  63.524  64.036  80.8516  65.433  72.149  174.069    25   a
 RcppBasic  23.706  27.381  70.9232  31.737 135.762  144.999    25   a
 RcppImprv 132.711 134.079 151.1988 136.953 139.232  250.357    25   a
 RcppEigen  51.693  75.133 172.8276 121.080 163.501  434.985    25   a
  RcppArma  79.163  95.515 290.6080 187.556 515.465  618.522    25   a
  collapse  65.720  66.837 240.1691  68.680  95.058 3777.815    25   a

> all.equal(A[, indices], extractColumns(A, indices - 1L))
[1] TRUE

> all.equal(A[, indices], extractColumns1(A, indices - 1L))
[1] TRUE

> all.equal(A[, indices], extractColumns2(A, indices - 1L))
[1] TRUE

> all.equal(A[, indices], extractColumns3(A, indices - 1L))
[1] TRUE

> all.equal(A[, indices], collapse::ss(A, 1:200, indices))
[1] TRUE
>

Comment

Lots of things matter. The 'improved Rcpp' function by @Mikko looks good on the original "and oddly shaped matrix", it underperforms vis-a-vis the Rcpp Sugar enhanced one ... because Sugar can give you parallel loop unrolling which starts to matter once there is looping to be done (and hurts when not as seen here). The 'collapse' function looks and beats base R on the "odd" matrix and is at par on the normal one. Eigen and Armadillo do their thing, on the "odd" matrix they suffer (and using them forces casts that have a small cost). The bad performance of RcppArmadillo on the second example is a bit of head-scratcher.

So in sum: Details matter. The setup matters. Nothing really dominates all use cases.