Simulate multiple datasets in R, and change seed for each simulation

I want to generate 1 million simulated datasets and save them in a list column.

The problem I run into is that every generated dataset is identical, which I assume is a result of the operation being vectorized. Is there a way to maintain the speed of a vectorized operation while changing the seed for each execution of rbinom()?

library(tidyverse)

SIMS <- 100
df <- 
    tibble(
        sim = 1:5,
        data = 
            list(
                tibble(
                    read = 1:100,
                    events = rbinom(n=SIMS, size=1e3, prob = .25),
                    trials = 1e3
                )
        )
    )

df

head(df$data[[1]])

head(df$data[[2]])

I have tried using parallel::mclapply(), which solves the seed issue, but the performance is nowhere near as fast as the vectorized operation.

BASERATE <- .25 
DAILY_N <- 1e3
DURATION <- 500

time_to_insight <- function(){
  tibble(
    read = 1:DURATION,
    events = rbinom(n=DURATION, size=DAILY_N, prob = BASERATE),
    trials = DAILY_N
  ) 
}

temp <- 
  parallel::mclapply(
    X = 1:1000,
    FUN = time_to_insight,
    mc.cores = 8L
  )

Solution

You can (almost) maintain the speed of your original function by generating the realisations of your random variable (events) for all of your simulations at the same time. This is possible because your event probability remains constant for all the simulations.

Here's how I would do it. First write a function to generate all your simulations at the same time.

generateTibble <- function() {
  tibble() %>% 
    expand(sim=1:5, read=1:100, trials=1000) %>% 
    mutate(events=rbinom(n=nrow(.), size=1000, prob=0.25))
}

And use it

df <- generateTibble()
df
# A tibble: 500 × 4
     sim  read trials events
   <int> <int>  <dbl>  <int>
 1     1     1   1000    246
 2     1     2   1000    232
 3     1     3   1000    258
 4     1     4   1000    233
 5     1     5   1000    259
 6     1     6   1000    251
 7     1     7   1000    254
 8     1     8   1000    263
 9     1     9   1000    267
10     1    10   1000    266
# … with 490 more rows

Demonstrate that realisations do not repeat across sims:

df %>% group_by(sim) %>% group_map(head, n=2)
[[1]]
# A tibble: 2 × 3
   read trials events
  <int>  <dbl>  <int>
1     1   1000    246
2     2   1000    232

[[2]]
# A tibble: 2 × 3
   read trials events
  <int>  <dbl>  <int>
1     1   1000    255
2     2   1000    261

[[3]]
# A tibble: 2 × 3
   read trials events
  <int>  <dbl>  <int>
1     1   1000    234
2     2   1000    239

[[4]]
# A tibble: 2 × 3
   read trials events
  <int>  <dbl>  <int>
1     1   1000    252
2     2   1000    258

[[5]]
# A tibble: 2 × 3
   read trials events
  <int>  <dbl>  <int>
1     1   1000    255
2     2   1000    258

Compare the speed of the new function with the original (faulty) version:

SIMS <- 100

originalFunction <- function() {
df <- 
  tibble(
    sim = 1:5,
    data = 
      list(
        tibble(
          read = 1:100,
          events = rbinom(n=SIMS, size=1e3, prob = .25),
          trials = 1e3
        )
      )
  )
df
}

library(microbenchmark)

microbenchmark(originalFunction, generateTibble, times=100)
Unit: nanoseconds
             expr min lq  mean median uq  max neval
 originalFunction  42 43 52.95     44 44  877   100
   generateTibble  47 48 68.14     48 54 1403   100

So, a 28% increase in mean execution time. Is that reasonable?