Search code examples
rtidymodels

Bootstrapping using tidymodels from a list of dataframes in R


I am running a model using tidymodels, where split the data by group and run regressions on each individual dataframe. This works well. However, now I also need to bootstrap my results. I'm not sure how to build this into my existing code.

My original code looks something like this:

library(dplyr)

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year=year, group=group, value=value, female=female, smoker=smoker)

# cut the dataset into list
table_list <- dta %>%
  group_by(year, group) %>%
  group_split()

# fit model per subgroup
model_list <- lapply(table_list, function(x) glm(smoker ~ female, data=x,
                                                 family=binomial(link="probit")))
# predict
pred_list <- lapply(model_list, function(x) predict.glm(x, type = "response"))

I would like to bootstrap with replacement to obtain the bootstrapped predicted values. My gut feeling is that I should split the dataset further by creating random samples when I create the table_list. But how exactly do I do that?

Thanks for your help.


Solution

  • This is fairly complex, with the grouping and the bootstrapping, so I would probably approach it like this, using map() two layers deep:

    library(tidyverse)
    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    
    year <- rep(2014:2018, length.out=10000)
    group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
    value <- sample(10000, replace=T)
    female <- sample(c(0,1), replace=TRUE, size=10000)
    smoker <- sample(c(0,1), replace=TRUE, size=10000)
    dta <- tibble(year=year, group=group, value=value, female=female, smoker=smoker)
    
    
    glm_boot_mods <- 
      dta %>%
      nest(data = c(-year, -group)) %>%
      mutate(boots = map(
        data,  
        ~ bootstraps(., times = 20) %>%
          mutate(model = map(.$splits, ~ glm(smoker ~ female, data = analysis(.x),
                                             family = binomial(link = "probit"))),
                 preds = map2(model, .$splits, ~predict(.x, newdata = assessment(.y))))
        ))
    
    
    glm_boot_mods
    #> # A tibble: 35 × 4
    #>     year group data               boots                
    #>    <int> <dbl> <list>             <list>               
    #>  1  2014     1 <tibble [288 × 3]> <bootstraps [20 × 4]>
    #>  2  2015     4 <tibble [273 × 3]> <bootstraps [20 × 4]>
    #>  3  2016     3 <tibble [301 × 3]> <bootstraps [20 × 4]>
    #>  4  2017     2 <tibble [282 × 3]> <bootstraps [20 × 4]>
    #>  5  2018     0 <tibble [276 × 3]> <bootstraps [20 × 4]>
    #>  6  2014     3 <tibble [279 × 3]> <bootstraps [20 × 4]>
    #>  7  2016     2 <tibble [314 × 3]> <bootstraps [20 × 4]>
    #>  8  2018     1 <tibble [296 × 3]> <bootstraps [20 × 4]>
    #>  9  2014     0 <tibble [304 × 3]> <bootstraps [20 × 4]>
    #> 10  2015     6 <tibble [288 × 3]> <bootstraps [20 × 4]>
    #> # … with 25 more rows
    

    The first map() creates the bootstrap resamples for each grouping, and then we go one layer deeper and for each resample fit a model and predict for the heldout observations for that resample. You can see what that looks like inside here for the first group:

    glm_boot_mods %>%
      head(1) %>% 
      pull(boots)
    #> [[1]]
    #> # Bootstrap sampling 
    #> # A tibble: 20 × 4
    #>    splits            id          model  preds      
    #>    <list>            <chr>       <list> <list>     
    #>  1 <split [288/111]> Bootstrap01 <glm>  <dbl [111]>
    #>  2 <split [288/93]>  Bootstrap02 <glm>  <dbl [93]> 
    #>  3 <split [288/103]> Bootstrap03 <glm>  <dbl [103]>
    #>  4 <split [288/106]> Bootstrap04 <glm>  <dbl [106]>
    #>  5 <split [288/109]> Bootstrap05 <glm>  <dbl [109]>
    #>  6 <split [288/109]> Bootstrap06 <glm>  <dbl [109]>
    #>  7 <split [288/92]>  Bootstrap07 <glm>  <dbl [92]> 
    #>  8 <split [288/111]> Bootstrap08 <glm>  <dbl [111]>
    #>  9 <split [288/99]>  Bootstrap09 <glm>  <dbl [99]> 
    #> 10 <split [288/111]> Bootstrap10 <glm>  <dbl [111]>
    #> 11 <split [288/102]> Bootstrap11 <glm>  <dbl [102]>
    #> 12 <split [288/104]> Bootstrap12 <glm>  <dbl [104]>
    #> 13 <split [288/115]> Bootstrap13 <glm>  <dbl [115]>
    #> 14 <split [288/111]> Bootstrap14 <glm>  <dbl [111]>
    #> 15 <split [288/108]> Bootstrap15 <glm>  <dbl [108]>
    #> 16 <split [288/110]> Bootstrap16 <glm>  <dbl [110]>
    #> 17 <split [288/110]> Bootstrap17 <glm>  <dbl [110]>
    #> 18 <split [288/111]> Bootstrap18 <glm>  <dbl [111]>
    #> 19 <split [288/103]> Bootstrap19 <glm>  <dbl [103]>
    #> 20 <split [288/109]> Bootstrap20 <glm>  <dbl [109]>
    

    Created on 2021-11-02 by the reprex package (v2.0.1)

    Notice that there are predictions for the heldout observations for each resample. Depending on what you want to do, you can use unnest() on the columns of glm_boot_mods you need to handle next.