I am attempting to conduct a simulation in which I have 257 sub-distrct and a list of villages within each sub-district, each with a known percentage of people that have a certain condition. I want to iterate over five numbers of villages (n_village
in {5, 10, 20 ,30 ,50}) and five numbers of people per village (n_person
in {10, 15, 20, 25, 30}) and for each I want to repeat a survey 1000 times. That means 5 * 5 * 257 * 1000 = 6,425,000 results. Furthermore, my list of sites is 4,999, 3836 for one country, and 1163 for another.
For all sub-district:
n_village
villages.n_person
individuals and count how many people have the condition.My first attempt involved iterating over everything with mapply and saving the results randomly sampled villages to a list, but that was taking very, very long.
I made a lot of headway using nested tibbles
and pmap
, but I am now running into memory issues and performance slow-downs.
Below is a 10 sub-district approximation of my approach. I am also restricting the example to 10 simulations.
library(tidyverse)
library(tictoc)
admin_unit_df <-
tibble(
shapeName_0 = "Country 1",
shapeName_1 = "Region 1",
shapeName_2 = "District 1",
shapeName_3 = paste0("Sub-district ", 1:10)
)
sampling_frame_df <-
tibble(
site_id = 1:500,
shapeName_3 = sample(admin_unit_df$shapeName_3, 500, replace = T),
site.prev = runif(500)
) %>%
group_by(
shapeName_3
) %>%
nest()
sim_parms_df <-
expand.grid(
"n_village" = c(2, 5, 10, 15, 20),
"n_person" = c(10, 20, 30, 40, 50),
"sim_id" = 1:10
) %>%
arrange(n_village, n_person, sim_id) %>%
as_tibble() %>%
rowwise()
simulations_df <-
admin_unit_df %>%
mutate(
data = sampling_frame_df$data,
sim_parms = list(sim_parms_df)
)
simulations_df
simulations_df$sim_parms[[1]]
I then randomly sample the villages from the data
tic()
simulations_df %<>%
mutate(
sim_parms =
pmap(
.l =
list(
"sim_parms" = sim_parms,
"data" = data
),
.f =
function(sim_parms, data){
n_site_avail = nrow(data)
sim_parms %>%
mutate(
samples =
list(
data[sample(x = 1:n_site_avail, size = min(n_village, n_site_avail), replace = F), ] %>%
as_tibble() %>%
rowwise()
)
)
}
)
)
toc()
After this, I use pmap to iterate and randomly sample the number of people positive for the condition at each village.
tic()
simulations_df %<>%
mutate(
sim_parms =
pmap(
.l =
list(
"sim_parms" = sim_parms
),
.f =
function(sim_parms){
# cat(names(sim_parms), "\n")
sim_parms %>%
ungroup() %>%
mutate(
samples =
pmap(
.l = list("samples" = samples, "n_person" = n_person),
.f = function(samples, n_person){
# cat(names(samples), "\n")
samples %>%
mutate(
n_pos = rbinom(n = 1, size = n_person, p = site.prev),
p_hat = n_pos / n_person
)
}
)
)
}
)
)
toc()
Things start to slow down and blow up here, but I've successfully completed this.
The next thing I want to do is use binom.test()
to calculate the lower and upper confidence interval bounds for the estimated prevalence.
This is a very basic analysis, yet it takes quite a long time. At some point, I aim to use lme4 to conduct a multilevel model analysis for all 1000 simulations.
Any tips on how to make this task tractable? Is there a way to loop and save results without saving all of the intermediary values?
As an aside -- saving to RDS and reading the RDS takes very, very long. Any suggestions to speed this up?
You can gain a lot of performance by operating on a big data.frame
instead of long list
s/nested columns with small data.frame
entries. A grouped data.frame
is usually way faster to operate on.
Also, try to avoid carrying grouped data (rowwise()
is also a grouping) around for operations where the grouping is unnecessary. The new .by/by
argument for dplyr
verbs is excellent since it allows you to apply the grouping just for the current operation.
Here is a quick rewrite of your code. The idea is to limit the use of map()
functions and, when used, bind the results into a data.frame
before applying the next operation.
(I checked the results of my code only briefly, and it seems to produce the same results as yours. Please check closely to make sure that it adheres to your specifications.)
library(tidyverse)
library(tictoc)
tic()
set.seed(1)
admin_unit_df <-
tibble(
shapeName_0 = "Country 1",
shapeName_1 = "Region 1",
shapeName_2 = "District 1",
shapeName_3 = paste0("Sub-district ", 1:10)
)
sampling_frame_df <-
tibble(
site_id = 1:500,
shapeName_3 = sample(admin_unit_df$shapeName_3, 500, replace = T),
site.prev = runif(500),
)
sim_parms_df <-
expand.grid(
"n_village" = c(2, 5, 10, 15, 20),
"n_person" = c(10, 20, 30, 40, 50),
"sim_id" = 1:10
) |>
arrange(n_village, n_person, sim_id) |>
as_tibble()
samples_df <-
sim_parms_df |>
mutate(samples =
pmap(list(n_village,
n_person),
\(villages, persons) {
df =
slice_sample(sampling_frame_df,
n = villages,
by = shapeName_3)
})) |>
unnest(samples) |>
group_by(shapeName_3, n_village, n_person, sim_id) |>
mutate(
n_sites_sampled = n()
) |>
ungroup()
res <-
samples_df |>
group_by(shapeName_3, site_id) |>
mutate(n_pos = rbinom(n = 1, size = n_person, p = site.prev),
p_hat = n_pos / n_person)
toc()
#> 0.364 sec elapsed
res
#> # A tibble: 24,500 × 8
#> # Groups: shapeName_3, site_id [500]
#> n_village n_person sim_id site_id shapeName_3 site.prev n_pos p_hat
#> <dbl> <dbl> <int> <int> <chr> <dbl> <int> <dbl>
#> 1 2 10 1 254 Sub-district 9 0.00573 0 0
#> 2 2 10 1 155 Sub-district 9 0.971 10 1
#> 3 2 10 1 70 Sub-district 4 0.0764 0 0
#> 4 2 10 1 451 Sub-district 4 0.0389 0 0
#> 5 2 10 1 278 Sub-district 7 0.188 3 0.3
#> 6 2 10 1 157 Sub-district 7 0.0392 0 0
#> 7 2 10 1 258 Sub-district 1 0.167 4 0.4
#> 8 2 10 1 259 Sub-district 1 0.764 9 0.9
#> 9 2 10 1 5 Sub-district 2 0.820 9 0.9
#> 10 2 10 1 496 Sub-district 2 0.305 4 0.4
#> # ℹ 24,490 more rows
For the example in the question, this code is ~92% faster than the original code. When we scale up the number of simulations, the difference gets even bigger: At 100 sub-districts, the original code takes 393.428 seconds (on my machine). The refactored code takes 1.288 seconds. That's a speed-up of 99.6%.