Search code examples
rlapplypurrrnested-lists

Specify value from data frame when list name matches


I have a very large nested list and I would like to take a random sample from each nested list, but the sample size for each of the lists differs. The sample size is defined in a data frame. I would like to write a function that matches the name of first level list to the corresponding entry in the data frame to then define the required sample size.

A simpler version of my nested list here.

mydf <- data.frame(
  name = c("A", "B", "C", "D"),
  n = c(5, 18, 25, 7))

df1 <- data.frame(matrix(1:10, nrow = 5, ncol=3))
df2 <- data.frame(matrix(1:10, nrow = 15, ncol=3))
df3 <- data.frame(matrix(1:10, nrow = 20, ncol=3))
df4 <- data.frame(matrix(1:10, nrow = 40, ncol=3))
df5 <- data.frame(matrix(1:10, nrow = 27, ncol=3))
df6 <- data.frame(matrix(1:10, nrow = 78, ncol=3))

list1 <- list(df1 = df1, df2 = df2, df3 = df3)
list2 <- list(df1 = df4, df2 = df5, df3 = df6)
list3 <- list(df1 = df2, df2 = df5, df3 = df4)
list4 <- list(df1 = df6, df2 = df3, df3 = df1)

mylist <- list("A" = list1, 
               "B" = list2,
               "C" = list3,
               "D" = list4)


mylist

I have written a work-around where I extract the first level lists one by one and specify the names I'd like to match, but this requires for me to copy my code multiple times. Ideally, I would like to automate this step.

Ultimately, I need to run this step several thousand times so it needs to be efficient.

library(purrr)
set.seed(0)

A <- mylist$A
A2 <- lapply(A, function(df){
  N <- pluck( n, 'n', which(mydf$name == 'A')) 
  rd <- function(x) sample(x, size = N, replace =TRUE)
  df <- apply(df, 2, rd)
  })

A2

Solution

  • I think the task is easier with less list action and more tables. but I admittedly may have been biased to this by the column consistency between the various frames to be sampled from. Though if they were different sized, probably a version with nesting would work.

    
    library(tidyverse)
    
    (mylongdata <- bind_rows(
      map(mylist, \(x){
        bind_rows(x, .id = "inner_id")
      }), .id = "outer_id"
    ))
    
    inner_func <- function(name, n) {
      filter(
        mylongdata,
        outer_id == name
      ) |>
        group_by(inner_id) |>
        dplyr::slice_sample(
          n = n,
          replace = TRUE
        )
    }
    
    sample_results <- map2(
      mydf$name,
      mydf$n, inner_func
    )