I am using R to create a simulation that takes samples from a random normal distribution, and not surprisingly, it's fairly slow. So, I looked for some ways to speed it up using Rcpp, and came across the RcppZiggurat package for faster random normal samples, and the RcppParallel package for multi-threaded computation, and thought, why not use both a faster algorithm and draw the samples in parallel?
So I started prototyping, and finally ended up with three methods to compare:
rnorm
Below are my implementations using RcppParallel + RcppZiggurat (the parallelDraws
function), and just RcppZiggurat (the serialDraws
function):
#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>
static Ziggurat::Ziggurat::Ziggurat zigg;
using namespace RcppParallel;
struct Norm : public Worker
{
int input;
// saved draws
RVector<double> draws;
// constructors
Norm(const int input, Rcpp::NumericVector draws)
: input(input), draws(draws) {}
void operator()(std::size_t begin, std::size_t end) {
for (std::size_t i = begin; i < end; i++) {
draws[i] = zigg.norm();
}
}
};
// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(x);
// declare the Norm instance
Norm norm(x, draws);
// call parallelFor to start the work
parallelFor(0, x, norm);
// return the draws
return draws;
};
// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(x);
for (int i = 0; i < x; i++) {
draws[i] = zigg.norm();
}
// return the draws
return draws;
};
When I benchmarked them, I found some surprising results:
library(microbenchmark)
microbenchmark(parallelDraws(1e5), serialDraws(1e5), rnorm(1e5))
Unit: microseconds
expr min lq mean median uq max neval
parallelDraws(1e+05) 3113.752 3539.686 3687.794 3599.1540 3943.282 5058.376 100
serialDraws(1e+05) 695.501 734.593 2536.940 757.2325 806.135 175712.496 100
rnorm(1e+05) 6072.043 6264.030 6655.835 6424.0195 6661.739 18578.669 100
Using RcppZiggurat alone was about 8x faster than rnorm
, but using RcppParallel and RcppZiggurat together was slower than using RcppZiggurat alone! I tried playing around with the grain size for the RcppParallel ParallelFor
function, but it didn't result in any noticeable improvement.
My question is: What could be the reason why adding parallelism is actually worse here? I know that "overhead" in parallel computations can outweigh the benefits depending on various factors. Is that what is happening here? Or am I completely misunderstanding how to effectively use the RcppParallel package?
As mentioned in the comments, overhead can be problematic especially when the overall runtime is short, it is better to not initialize the output vectors with zero and to use thread local RNGs. Example implementation:
#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>
using namespace RcppParallel;
struct Norm : public Worker
{
// saved draws
RVector<double> draws;
// constructors
Norm(Rcpp::NumericVector draws)
: draws(draws) {}
void operator()(std::size_t begin, std::size_t end) {
Ziggurat::Ziggurat::Ziggurat zigg(end);
for (std::size_t i = begin; i < end; i++) {
draws[i] = zigg.norm();
}
}
};
// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(Rcpp::no_init(x));
Norm norm(draws);
parallelFor(0, x, norm);
return draws;
}
// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
// allocate the output vector
Rcpp::NumericVector draws(Rcpp::no_init(x));
Ziggurat::Ziggurat::Ziggurat zigg(42);
for (int i = 0; i < x; i++) {
draws[i] = zigg.norm();
}
return draws;
}
Note that I am using "poor man's parallel RNG", i.e. distinct seeds for the different threads, and hope for the best. I am using end
as seed, since begin
might be zero and I am not sure if the RNG in RcppZiggurat likes that. Since it takes some time (and memory) to create a Ziggurat
object, I also use a local one for the serial computation to be fair.
For 10^5 random draws, there is still no gain from using parallel computation:
> bench::mark(parallelDraws(1e5), serialDraws(1e5), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 parallelDraws(1e+05) 1.08ms 1.78ms 558. 784KB
2 serialDraws(1e+05) 624.16µs 758.6µs 1315. 784KB
But for 10^8 draws I get a nice speed-up on my dual core laptop:
> bench::mark(parallelDraws(1e8), serialDraws(1e8), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 parallelDraws(1e+08) 326ms 343ms 2.91 763MB
2 serialDraws(1e+08) 757ms 770ms 1.30 763MB
So whether it makes sense to use parallel computation depends heavily on the number of random draws you need.
BTW, in the comments my dqrng package is mentioned. This package also uses the Ziggurat method for normal (and exponential) draws combined with very fast 64bit RNGs, giving it comparable serial speed to RcppZiggurat for normal draws. In addition, the used RNGs are ment for parallel computation, i.e. there is no need for hoping to get non-overlapping random streams by using different seeds.