Search code examples
rperformanceloopsmemorymultidimensional-array

How to optimize memory and speed for repeated survey-simulations in R?


I am attempting to conduct a simulation in which I have 257 sub-distrct and a list of villages within each sub-district, each with a known percentage of people that have a certain condition. I want to iterate over five numbers of villages (n_village in {5, 10, 20 ,30 ,50}) and five numbers of people per village (n_person in {10, 15, 20, 25, 30}) and for each I want to repeat a survey 1000 times. That means 5 * 5 * 257 * 1000 = 6,425,000 results. Furthermore, my list of sites is 4,999, 3836 for one country, and 1163 for another.

For all sub-district:

  1. Randomly select n_village villages.
  2. Within each resulting village, randomly select n_person individuals and count how many people have the condition.
  3. Calculate the estimated district-level percentage of people with the conndition, along with the confidence interval for said percentage.

My first attempt involved iterating over everything with mapply and saving the results randomly sampled villages to a list, but that was taking very, very long.

I made a lot of headway using nested tibbles and pmap, but I am now running into memory issues and performance slow-downs.

Below is a 10 sub-district approximation of my approach. I am also restricting the example to 10 simulations.

library(tidyverse)
library(tictoc)

admin_unit_df <- 
  tibble(
    shapeName_0 = "Country 1", 
    shapeName_1 = "Region 1",
    shapeName_2 = "District 1",
    shapeName_3 = paste0("Sub-district ", 1:10)
  )

sampling_frame_df <- 
  tibble(
    site_id = 1:500,
    shapeName_3 = sample(admin_unit_df$shapeName_3, 500, replace = T),
    site.prev = runif(500)
  ) %>%
  group_by(
    shapeName_3
  ) %>%
  nest()

sim_parms_df <-
  expand.grid(
    "n_village" = c(2, 5, 10, 15, 20),
    "n_person"  = c(10, 20, 30, 40, 50),
    "sim_id"    = 1:10
  ) %>%
  arrange(n_village, n_person, sim_id) %>%
  as_tibble() %>%
  rowwise() 

simulations_df <-
  admin_unit_df %>%
  mutate(
    data = sampling_frame_df$data,
    
    sim_parms = list(sim_parms_df)
  )

simulations_df

simulations_df$sim_parms[[1]]

I then randomly sample the villages from the data

tic()
simulations_df %<>%
  mutate(
    sim_parms = 
      pmap(
        .l = 
          list(
            "sim_parms" = sim_parms,
            "data" = data
          ),
        .f = 
          function(sim_parms, data){
            n_site_avail = nrow(data)
            
            sim_parms %>%
              mutate(
                samples =
                  list(
                    data[sample(x = 1:n_site_avail, size = min(n_village, n_site_avail), replace = F), ] %>%
                      as_tibble() %>%
                      rowwise()
                  )
              )
          }
    )
  )
toc()

After this, I use pmap to iterate and randomly sample the number of people positive for the condition at each village.

tic()
simulations_df %<>%
  mutate(
    sim_parms = 
      pmap(
        .l = 
          list(
            "sim_parms" = sim_parms
          ),
        .f = 
          function(sim_parms){
            # cat(names(sim_parms), "\n")
            
            sim_parms %>%
              ungroup() %>%
              mutate(
                samples = 
                  pmap(
                    .l = list("samples" = samples, "n_person" = n_person),
                    .f = function(samples, n_person){
                      # cat(names(samples), "\n")
                      samples %>%
                        mutate(
                          n_pos = rbinom(n = 1, size = n_person, p = site.prev),
                          p_hat = n_pos / n_person
                        )
                    }
                  )
              )
            
          }
    )
  )
toc()

Things start to slow down and blow up here, but I've successfully completed this.

The next thing I want to do is use binom.test() to calculate the lower and upper confidence interval bounds for the estimated prevalence.

This is a very basic analysis, yet it takes quite a long time. At some point, I aim to use lme4 to conduct a multilevel model analysis for all 1000 simulations.

Any tips on how to make this task tractable? Is there a way to loop and save results without saving all of the intermediary values?

As an aside -- saving to RDS and reading the RDS takes very, very long. Any suggestions to speed this up?


Solution

  • You can gain a lot of performance by operating on a big data.frame instead of long lists/nested columns with small data.frame entries. A grouped data.frame is usually way faster to operate on.

    Also, try to avoid carrying grouped data (rowwise() is also a grouping) around for operations where the grouping is unnecessary. The new .by/by argument for dplyr verbs is excellent since it allows you to apply the grouping just for the current operation.

    Here is a quick rewrite of your code. The idea is to limit the use of map() functions and, when used, bind the results into a data.frame before applying the next operation.

    (I checked the results of my code only briefly, and it seems to produce the same results as yours. Please check closely to make sure that it adheres to your specifications.)

    library(tidyverse)
    library(tictoc)
    
    tic()
    
    set.seed(1)
    admin_unit_df <-
      tibble(
        shapeName_0 = "Country 1",
        shapeName_1 = "Region 1",
        shapeName_2 = "District 1",
        shapeName_3 = paste0("Sub-district ", 1:10)
      )
    
    sampling_frame_df <-
      tibble(
        site_id = 1:500,
        shapeName_3 = sample(admin_unit_df$shapeName_3, 500, replace = T),
        site.prev = runif(500),
      )
    
    sim_parms_df <-
      expand.grid(
        "n_village" = c(2, 5, 10, 15, 20),
        "n_person"  = c(10, 20, 30, 40, 50),
        "sim_id"    = 1:10
      ) |>
      arrange(n_village, n_person, sim_id) |>
      as_tibble()
    
    samples_df <-
      sim_parms_df |>
      mutate(samples =
               pmap(list(n_village,
                         n_person),
                    \(villages, persons) {
                      df = 
                       slice_sample(sampling_frame_df,
                                    n = villages,
                                    by = shapeName_3)
                    })) |>
      unnest(samples) |>
      group_by(shapeName_3, n_village, n_person, sim_id) |>
      mutate(
        n_sites_sampled = n()
      ) |>
      ungroup()
    
    res <-
      samples_df |>
      group_by(shapeName_3, site_id) |>
      mutate(n_pos = rbinom(n = 1, size = n_person, p = site.prev),
             p_hat = n_pos / n_person)
    
    toc()
    #> 0.364 sec elapsed
    
    res
    #> # A tibble: 24,500 × 8
    #> # Groups:   shapeName_3, site_id [500]
    #>    n_village n_person sim_id site_id shapeName_3    site.prev n_pos p_hat
    #>        <dbl>    <dbl>  <int>   <int> <chr>              <dbl> <int> <dbl>
    #>  1         2       10      1     254 Sub-district 9   0.00573     0   0  
    #>  2         2       10      1     155 Sub-district 9   0.971      10   1  
    #>  3         2       10      1      70 Sub-district 4   0.0764      0   0  
    #>  4         2       10      1     451 Sub-district 4   0.0389      0   0  
    #>  5         2       10      1     278 Sub-district 7   0.188       3   0.3
    #>  6         2       10      1     157 Sub-district 7   0.0392      0   0  
    #>  7         2       10      1     258 Sub-district 1   0.167       4   0.4
    #>  8         2       10      1     259 Sub-district 1   0.764       9   0.9
    #>  9         2       10      1       5 Sub-district 2   0.820       9   0.9
    #> 10         2       10      1     496 Sub-district 2   0.305       4   0.4
    #> # ℹ 24,490 more rows
    

    For the example in the question, this code is ~92% faster than the original code. When we scale up the number of simulations, the difference gets even bigger: At 100 sub-districts, the original code takes 393.428 seconds (on my machine). The refactored code takes 1.288 seconds. That's a speed-up of 99.6%.