Is there any way to efficiently translate the outer() function for multiplication of two vectors from R base to RcppArmadillo? I attempted to do so but it is not efficient at all.
Take the following example:
library(Rcpp)
library(RcppArmadillo)
library(microbenchmark)
#Outer attempt
cppFunction(depends = "RcppArmadillo",
'
arma::mat outer_rcpp(arma::vec x, arma::vec y) {
int x_length = x.n_elem;
int y_length = y.n_elem;
arma::mat final(x_length, y_length);
// And use loops instead of outer
for(int i = 0; i < x_length; i++) {
final.col(i) = x[i] * y;
}
return(final);
}
'
)
#Test for equal results
a <- rnorm(5)
base <- base::outer(a, a)
rcpp <- outer_rcpp(a, a)
all.equal(base, rcpp)
#Test for speed
b <- rnorm(5000)
microbenchmark(base = base::outer(b, b),
rcpp = outer_rcpp(b, b), times = 10)
The results are 2 times slower using R base. I am sure that this can be done though matrix multiplication, any idea how?
As @thelatemail pointed out in the comments, the outer
routine is already using a heavily optimized C routine.
src/library/base/R/outer.R
: tcrossprod
usage.src/main/array.c
: underlying C routine powering the tcrossprod
computation.Armadillo itself has its own optimization for addressing matrix multiplication using the dgemm
and dgemv
routines from LAPACK:
armadillo_bits/mul_gemm.hpp
: C := alpha*op( A )op( B ) + betaC,
armadillo_bits/mul_gemv.hpp
: y := alphaAx + betay, or y := alphaA**Tx + betay,
Playing around with the outerproduct calculations leads to a few optimizations. Mainly, we're opting to move the outer product into armadillo actions instead of loops:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat outer_rcpp(const arma::vec& x, const arma::vec& y) {
int x_length = x.n_elem;
int y_length = y.n_elem;
arma::mat final(x_length, y_length);
// And use loops instead of outer
for(int i = 0; i < x_length; i++) {
final.col(i) = x[i] * y;
}
return final;
}
// [[Rcpp::export]]
arma::mat outer_with_armadillo(const arma::vec& x, const arma::vec& y) {
arma::mat final = x*y.t();
return final;
}
// [[Rcpp::export]]
arma::mat outer_with_armadillo_transposed(const arma::vec& x, const arma::rowvec& y) {
arma::mat final = x*y;
return final;
}
Revisiting the benchmarking code, we have:
b = rnorm(5000)
b_tranposed = t(b)
bench_results = microbenchmark::microbenchmark(base = base::outer(b, b),
outer_armadillo_loop = outer_rcpp(b, b),
outer_armadillo_optimized = outer_with_armadillo(b, b),
outer_armadillo_optimized_transposed = outer_with_armadillo_transposed(b, b_tranposed), times = 10)
bench_results
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
base | 132.8601 | 141.3532 | 156.9979 | 146.7993 | 154.8954 | 234.2619 | 10 |
outer_armadillo_loop | 278.4115 | 279.9204 | 317.7907 | 288.4212 | 329.0769 | 451.6872 | 10 |
outer_armadillo_optimized | 272.4348 | 283.3380 | 347.7913 | 304.1181 | 339.3282 | 728.2264 | 10 |
outer_armadillo_optimized_transposed | 269.7855 | 270.7108 | 297.9580 | 279.8099 | 312.3488 | 386.4270 | 10 |
From the results, the lowest I could achieve is having a pre-transposed b
vector from column vector form into row-vector form: (n x 1) * (1 x m)