Parallel samples from a random normal distribution - not faster?

I am using R to create a simulation that takes samples from a random normal distribution, and not surprisingly, it's fairly slow. So, I looked for some ways to speed it up using Rcpp, and came across the RcppZiggurat package for faster random normal samples, and the RcppParallel package for multi-threaded computation, and thought, why not use both a faster algorithm and draw the samples in parallel?

So I started prototyping, and finally ended up with three methods to compare:

Samples using RcppParallel and RcppZiggurat together
Samples using just RcppZiggurat
Samples using good old rnorm

Below are my implementations using RcppParallel + RcppZiggurat (the parallelDraws function), and just RcppZiggurat (the serialDraws function):

#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>

static Ziggurat::Ziggurat::Ziggurat zigg;

using namespace RcppParallel;

struct Norm : public Worker
{   
  int input;

  // saved draws
  RVector<double> draws;

  // constructors
  Norm(const int input, Rcpp::NumericVector draws)
    : input(input), draws(draws) {}

  void operator()(std::size_t begin, std::size_t end) {
    for (std::size_t i = begin; i < end; i++) {
      draws[i] = zigg.norm();
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {

  // allocate the output vector
  Rcpp::NumericVector draws(x);

  // declare the Norm instance 
  Norm norm(x, draws);

  // call parallelFor to start the work
  parallelFor(0, x, norm);

  // return the draws
  return draws;
};

// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {

  // allocate the output vector
  Rcpp::NumericVector draws(x);

  for (int i = 0; i < x; i++) {
    draws[i] = zigg.norm();
  }

  // return the draws
  return draws;
};

When I benchmarked them, I found some surprising results:

library(microbenchmark)
microbenchmark(parallelDraws(1e5), serialDraws(1e5), rnorm(1e5))

Unit: microseconds
                 expr      min       lq     mean    median       uq        max neval
 parallelDraws(1e+05) 3113.752 3539.686 3687.794 3599.1540 3943.282   5058.376   100
   serialDraws(1e+05)  695.501  734.593 2536.940  757.2325  806.135 175712.496   100
         rnorm(1e+05) 6072.043 6264.030 6655.835 6424.0195 6661.739  18578.669   100

Using RcppZiggurat alone was about 8x faster than rnorm, but using RcppParallel and RcppZiggurat together was slower than using RcppZiggurat alone! I tried playing around with the grain size for the RcppParallel ParallelFor function, but it didn't result in any noticeable improvement.

My question is: What could be the reason why adding parallelism is actually worse here? I know that "overhead" in parallel computations can outweigh the benefits depending on various factors. Is that what is happening here? Or am I completely misunderstanding how to effectively use the RcppParallel package?

Solution

As mentioned in the comments, overhead can be problematic especially when the overall runtime is short, it is better to not initialize the output vectors with zero and to use thread local RNGs. Example implementation:

#include <Rcpp.h>
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::depends(RcppZiggurat)]]
#include <Ziggurat.h>


using namespace RcppParallel;

struct Norm : public Worker
{   
  // saved draws
  RVector<double> draws;
  
  // constructors
  Norm(Rcpp::NumericVector draws)
    : draws(draws) {}
  
  void operator()(std::size_t begin, std::size_t end) {
    Ziggurat::Ziggurat::Ziggurat zigg(end);
    for (std::size_t i = begin; i < end; i++) {
      draws[i] = zigg.norm();
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericVector parallelDraws(int x) {
  // allocate the output vector
  Rcpp::NumericVector draws(Rcpp::no_init(x));
  Norm norm(draws);
  parallelFor(0, x, norm);
  return draws;
}

// [[Rcpp::export]]
Rcpp::NumericVector serialDraws(int x) {
  // allocate the output vector
  Rcpp::NumericVector draws(Rcpp::no_init(x));
  Ziggurat::Ziggurat::Ziggurat zigg(42);
  for (int i = 0; i < x; i++) {
    draws[i] = zigg.norm();
  }
  return draws;
}

Note that I am using "poor man's parallel RNG", i.e. distinct seeds for the different threads, and hope for the best. I am using end as seed, since begin might be zero and I am not sure if the RNG in RcppZiggurat likes that. Since it takes some time (and memory) to create a Ziggurat object, I also use a local one for the serial computation to be fair.

For 10^5 random draws, there is still no gain from using parallel computation:

> bench::mark(parallelDraws(1e5), serialDraws(1e5), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 parallelDraws(1e+05)   1.08ms   1.78ms      558.     784KB
2 serialDraws(1e+05)   624.16µs  758.6µs     1315.     784KB

But for 10^8 draws I get a nice speed-up on my dual core laptop:

> bench::mark(parallelDraws(1e8), serialDraws(1e8), check = FALSE, min_iterations = 10)[,1:5]
# A tibble: 2 x 5
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 parallelDraws(1e+08)    326ms    343ms      2.91     763MB
2 serialDraws(1e+08)      757ms    770ms      1.30     763MB

So whether it makes sense to use parallel computation depends heavily on the number of random draws you need.

BTW, in the comments my dqrng package is mentioned. This package also uses the Ziggurat method for normal (and exponential) draws combined with very fast 64bit RNGs, giving it comparable serial speed to RcppZiggurat for normal draws. In addition, the used RNGs are ment for parallel computation, i.e. there is no need for hoping to get non-overlapping random streams by using different seeds.