Search code examples
rdplyrmodellmpredict

Generating prediction intervals for more than 1 linear model in R?


I am attempting to generate prediction intervals using the function predict() for a new set of data, but across more than one model that I have generated for a dataset. I am relatively inexperienced at using lapply, but figure it should be helpful in this process:

#Calling in my libraries:
library(dplyr)

#Creating dataset:

DNase <- DNase

#Generating models, one for each "Run" in DNAse:
model_dna <- DNase %>% 
  group_by(Run) %>% 
  do(model_dna_group = lm(log(density) ~ log(conc), data = .)) %>%   ungroup()

#Creating a new data set to be used to generate predictions:
new_dna <- as.data.frame(DNase$conc) %>% 
  mutate(conc = DNase$conc * 2) %>% select(conc)

#Attempting to apply predict to these models for a new data frame:
new_dna_w_predictions <- lapply(
                           X = model_dna, 
                           FUN = predict, 
                           newdata = new_dna, 
                           interval = "prediction", 
                           level = 0.9
                          )

However, this draws the following error:

Error in get(as.character(FUN), mode = "function", envir = envir) : object 'model_dna' of mode 'function' was not found

I am not sure how best to structure this lapply function, especially when being used across more than one model. Is there a generally cleaner way to approach this?


Solution

  • Here you have a full tidyverse solution:

    # Calling in my libraries:
    library(dplyr)
    library(purrr)
    
    # Creating dataset:
    DNase <- DNase
    
    # Creating a new data set to be used to generate predictions:
    new_dna <- DNase %>% transmute(conc = conc * 2)  # simplified
    
    # Generating models, one for each "Run" in DNAse:
    model_dna <- DNase %>% 
      group_by(Run) %>% 
      summarise(model_dna_group = list(lm(log(density) ~ log(conc))))
      
    model_dna
    #> # A tibble: 11 x 2
    #>    Run   model_dna_group
    #>    <ord> <list>         
    #>  1 10    <lm>           
    #>  2 11    <lm>           
    #>  3 9     <lm>           
    #>  4 1     <lm>           
    #>  5 4     <lm>           
    #>  6 8     <lm>           
    #>  7 5     <lm>           
    #>  8 7     <lm>           
    #>  9 6     <lm>           
    #> 10 2     <lm>           
    #> 11 3     <lm>
    
    
    # Run predictions
    model_dna %>%
      group_by(Run) %>% 
      summarise(map(model_dna_group, predict, newdata = new_dna, interval = "prediction", level = 0.9) %>% map_dfr(as_tibble),
                .groups = "drop")
    
    #> # A tibble: 1,936 x 4
    #>    Run       fit    lwr    upr
    #>    <ord>   <dbl>  <dbl>  <dbl>
    #>  1 10    -2.16   -2.48  -1.85 
    #>  2 10    -2.16   -2.48  -1.85 
    #>  3 10    -1.33   -1.64  -1.03 
    #>  4 10    -1.33   -1.64  -1.03 
    #>  5 10    -0.918  -1.22  -0.617
    #>  6 10    -0.918  -1.22  -0.617
    #>  7 10    -0.503  -0.804 -0.201
    #>  8 10    -0.503  -0.804 -0.201
    #>  9 10    -0.0873 -0.392  0.217
    #> 10 10    -0.0873 -0.392  0.217
    #> # ... with 1,926 more rows
    

    Created on 2021-11-19 by the reprex package (v2.0.0)

    Notice:

    • after dplyr 1.0 you don't need to use do anymore for this kind of cases
    • with map and map_dfr you can calculate your predictions and fit them nicely in your tibble