Search code examples
rparallel-processingpurrrr-futurefurrr

Ensure reproducibility across `purrr::map()` and `furrr::future_map()`


I am running simulations where some computing should be parallelized and some should not.

I am trying to figure out how to ensure reproducibility across purrr::map() and furrr::future_map() so that they yield the same result.

For some reason, I cannot use set.seed() inside the mapped function.

For instance, consider the following code:

library(purrr)
library(furrr)
#> Loading required package: future


set.seed(42)
rnorm(1)
#> [1] 1.370958

set.seed(42)
map(1, ~rnorm(1))
#> [[1]]
#> [1] 1.370958

set.seed(42)
future_map(1, ~rnorm(1), .options=furrr_options(seed=TRUE))
#> [[1]]
#> [1] -0.1691382

set.seed(42)
future_map(1, ~rnorm(1), .options=furrr_options(seed=42))
#> [[1]]
#> [1] -0.02648871

future_map(1, ~rnorm(1), .options=furrr_options(seed=list(42L)))
#> Error in `validate_seed_list()`:
#> ! All pre-generated random seed elements of a list `seed` must be valid `.Random.seed` seeds, which means they should be all integers and consists of two or more elements, not just one.

Created on 2023-02-21 with reprex v2.0.2

As you can see, I could not get the 1.37 value using furrr. Every call is reproducible but they yield different results.

In my real code, each function will run 100-200 times, which is less than length(.Random.seed) (==626).

I thus thought setting the seed as a list could be a solution, but I don't really understand the documentation or the error message.

For reference, here is the help file that addresses random seed management: link

Is there a way to have purrr::map() and furrr::future_map() yield the same result?

EDIT: for reference, here is the related GitHub issue.


Solution

  • Author of futureverse here.

    1. R uses RNGkind("Mersenne-Twister") by default. This type of random number generator (RNG) is valid only in sequential processing.

    2. For parallel processing, we have to use an RNG that is designed for parallel processing. If not, we will not get statistically sound random numbers and our results risk being biased. This is true for all parallel frameworks. R provides RNGkind("L'Ecuyer-CMRG") for parallel processing. Most parallel solutions rely on this, if at all (some don't worry about parallel RNG). There are alternative parallel RNG methods available in different CRAN packages.

    Because of (1) and (2), it is impossible to (a) reproduce random numbers produced in standard sequential processing in R, when (b) running in parallel. The only way to do achieve it is to change the sequential processing to also use parallel RNG (e.g. RNGkind("L'Ecuyer-CMRG")). Unfortunately, it's not just a matter of changing the RNG-kind settings. One also has to update the implementation of the underlying algorithm (here purrr). In contrast, the futureverse does this at the core (and makes it part of the design requirements). So,

    1. The future framework uses RNGkind("L'Ecuyer-CMRG") everywhere, regardless of parallel backend ("plan") and number of parallel workers.

    Thus, in your case using furrr, you will get the exact same random numbers when you use plan(sequential) (default), plan(multicore), plan(multisession), plan(future.callr::callr), plan(future.batchtools::batchtools_slurm), etc.

    So, in summary:

    You have to accept that:

    library(purrr)
    set.seed(42)
    map(1:10, ~rnorm(1))
    

    and

    library(furrr)
    set.seed(42)
    future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))
    

    will produce a different sequence of random numbers, but both are still statistically sound. When accepting that, it is nice to know that regardless of which plan() you set, you'll get identical random-number sequences, e.g.

    plan(sequential)
    set.seed(42)
    future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))
    

    gives the same results as:

    plan(multisession)
    set.seed(42)
    future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))