I have a very large nested list and I would like to take a random sample from each nested list, but the sample size for each of the lists differs. The sample size is defined in a data frame. I would like to write a function that matches the name of first level list to the corresponding entry in the data frame to then define the required sample size.
A simpler version of my nested list here.
mydf <- data.frame(
name = c("A", "B", "C", "D"),
n = c(5, 18, 25, 7))
df1 <- data.frame(matrix(1:10, nrow = 5, ncol=3))
df2 <- data.frame(matrix(1:10, nrow = 15, ncol=3))
df3 <- data.frame(matrix(1:10, nrow = 20, ncol=3))
df4 <- data.frame(matrix(1:10, nrow = 40, ncol=3))
df5 <- data.frame(matrix(1:10, nrow = 27, ncol=3))
df6 <- data.frame(matrix(1:10, nrow = 78, ncol=3))
list1 <- list(df1 = df1, df2 = df2, df3 = df3)
list2 <- list(df1 = df4, df2 = df5, df3 = df6)
list3 <- list(df1 = df2, df2 = df5, df3 = df4)
list4 <- list(df1 = df6, df2 = df3, df3 = df1)
mylist <- list("A" = list1,
"B" = list2,
"C" = list3,
"D" = list4)
mylist
I have written a work-around where I extract the first level lists one by one and specify the names I'd like to match, but this requires for me to copy my code multiple times. Ideally, I would like to automate this step.
Ultimately, I need to run this step several thousand times so it needs to be efficient.
library(purrr)
set.seed(0)
A <- mylist$A
A2 <- lapply(A, function(df){
N <- pluck( n, 'n', which(mydf$name == 'A'))
rd <- function(x) sample(x, size = N, replace =TRUE)
df <- apply(df, 2, rd)
})
A2
I think the task is easier with less list action and more tables. but I admittedly may have been biased to this by the column consistency between the various frames to be sampled from. Though if they were different sized, probably a version with nesting would work.
library(tidyverse)
(mylongdata <- bind_rows(
map(mylist, \(x){
bind_rows(x, .id = "inner_id")
}), .id = "outer_id"
))
inner_func <- function(name, n) {
filter(
mylongdata,
outer_id == name
) |>
group_by(inner_id) |>
dplyr::slice_sample(
n = n,
replace = TRUE
)
}
sample_results <- map2(
mydf$name,
mydf$n, inner_func
)