Search code examples
rload-balancingrandom-seedreproducible-researchmclapply

R: inconsistent random number generation in parallel simulation with mclapply


Problem

I'm trying to implement a reproducible multicore simulation and obtain inconsistent results. Please help me explain these results and advise me of a correct way of implementing this. Note that I'm working on WSL2 (I hope there is another reason for my results though).

Details

Each parallel task generates random numbers and is dynamically assigned to an available core (as opposed to beeing prescheduled to cores in advance). According to the documentation, this can be achieved by using mc.preschedule=FALSE in parallel::mclapply. To guarantee reproducibility, tasks should generate the same random numbers independently of the nodes they are assigned to when a random seed is set.

Attempted Solution

My idea is to assign a separate (independent) random number stream to each task using RNGkind("L'Ecuyer-CMRG") and parallel::nextRNGStream. The following snippet generates a list of seeds associated with these streams with one list entry for each task.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
n <- 100  # number of tasks

set.seed(1)
seeds <- list(.Random.seed)
for (i in 2:n) {
  seeds[[i]] <- nextRNGStream(seeds[[i - 1]])
}

The idea is now that each task sets its seed before it starts generating random numbers. I use a function f to represent some task.

f <- function(i, seeds) {
  .Random.seed <- seeds[[i]]
  rnorm(1)
}

Inconsistent Results

I would expect the results of the tasks beeing independent of the parameter mc.set.seed in parallel::mclapply since the tasks set their own seeds anyway. This is not the case, however, as can be observed here:

cores <- 2  # set to more than one
r1 <- mclapply(1:n, f, seeds=seeds, mc.preschedule=FALSE, mc.cores=cores, mc.set.seed=TRUE)
r2 <- mclapply(1:n, f, seeds=seeds, mc.preschedule=FALSE, mc.cores=cores, mc.set.seed=FALSE)
cat("r1: ", sum(unlist(r1)), "\n")
cat("r2: ", sum(unlist(r2)), "\n")
# r1:  24.39407 
# r2:  46.08108

Moreover, I would expect the tasks to generate the same random numbers whether they are executed serially or in parallel. This is not the case either:

r3 <- mclapply(1:n, f, seeds=seeds, mc.preschedule=FALSE, mc.set.seed=FALSE, mc.cores=1)
cat("r3: ", sum(unlist(r3)), "\n")
# r3:  -7.079515

Why do these results occur and what is the correct way of implementing this?


Solution

  • You are setting .Random.seed <- seeds[[i]] in a function. That sets a local variable, not the global random seed. Use .Random.seed <<- seeds[[i]] instead, and it should work.

    The "super-assignment" operator <<- looks through parent environments until it finds an existing variable matching the name, and does the assignment there. If it doesn't find one, it makes the assignment in the global environment. This means in the normal case it will fix your issue, but it's possible you accidentally have another variable named .Random.seed that will be found first, in which case it won't work. So don't do that.