Is there a faster way to do this Cholesky factorization in Rcpp/c++?

This is a problem arose in my Markov Chain Monte Carlo (MCMC) algorithm. And I feel this problem is quite commonly encountered especially in hierarchical Gaussian models. Hence it would be great if there is a much more efficient solution. So the problem is like this:

I have many positive integer vectors xi, for i from 1 to n, a p.s.d. matrix A and a p.s.d. matrix B. For every xi I want to compute the following Cholesky factorization:

chol( kron( diagmat( xi ), A ) + B )

So kron( diagmat( xi ), A ) + B is the covariance matrix for a multivariate Gaussian and I want to sample from this Gaussian hence need the Cholesky factorization of it. The dimension for A and B are not small and I have a large n hence computing the above Cholesky factorization for all xi is really time-consuming. Below is the Rcpp function I wrote using RcppArmadillo:

#include <cmath>
#include <Rmath.h>
#include "RcppArmadillo.h"
// [[Rcpp::depends(RcppArmadillo)]]
using namespace arma;
using namespace Rcpp;

// [[Rcpp::export]]
mat test_C(mat A, mat B, mat X){
  int n_x = X.n_cols;
  int d_B = B.n_rows;
  mat sample(d_B, n_x);
  mat mS_chol_inv;
  for (int i = 0; i < n_x; i++){
    mS_chol_inv = inv(trimatu( chol(kron(diagmat( X.col(i) ),A) + B) ));
    sample.col(i) = mS_chol_inv*randn(d_B);
  }
  return(sample);
}

I also test the computational efficiency using the following code comparing it to its R counterpart:

test_R <- function(A,B,X){
  n_x <- ncol(X)
  d_B <- ncol(B)
  res <- sapply(1:n_x, function(x){
    mS_chol <- chol( kronecker( diag(X[,x]),A ) + B )
    return( mS_chol%*%as.matrix( rnorm(d_B) ) )
  })
  return(res)
}

# Simulate Data
R1 <- matrix(rnorm(24*2),24,2)
A <- R1%*%t(R1) + 0.1*diag(24)
R2 <- matrix(rnorm(264*2),264,2)
B <- R2%*%t(R2) + 0.1*diag(264)
X <- matrix(rpois(11*2178, 5),11,2178)

res <- benchmark(res_R <- test_R(A, B, X),
             res_C <- test_C(A, B, X),
             columns=c("test", "replications", "elapsed", "relative"),
             order="relative",
             replications = 2)

And the result is as follows

> print(res) 
                      test replications elapsed relative
1 res_R <- test_R(A, B, X)            2  18.920    1.000
2 res_C <- test_C(A, B, X)            2  20.724    1.095

As can be seen, a single run is approximately 10 second, and this is simply not feasible in a MCMC algorithm. Also, since the chol() dominates the computational complexity, the improvement of using Rcpp over pure R is trivial. But maybe I did not write the most efficient code? So any advice?

Since the matrix inside chol() is very structured and the only thing that is varying is xi, maybe there is some algebra trick that I do not know that can solve this efficiently? I have posted this as a linear-algebra question under Mathematics and here is the link. Unfortunately so far I have not received any solution, people do point out that this is embarrassingly parallel.

Any advice on code/algebra will be helpful! Thanks ahead for your time.

Solution

You may try to use Microsoft R from here, artist formerly known as RRO. It integrates with multi-threaded Intel MKL library (same place to find and install it) and on Intel hardware matrix operations are quite fast.

Disclaimer: YMMV