Search code examples
rtidyversepurrrlmstargazer

Estimating multiple `lm` models and returning output in one table, with map()


I need to estimate a number of linear models on the same dataset, and put the regression results all into one table. For a reproducible example, here's a simplification using mtcars:

formula_1 = "mpg ~ disp"
formula_2 = "mpg ~ log(disp)"
formula_3 = "mpg ~ disp + hp" 

Currently, my approach has been to:

  1. Create a list that contains all of the formulae.
  2. use purrr:map() to estimate all of the lm models.
  3. use stargazer:: to produce output tables.
library(tidyverse)
library(stargazer)

formula_1 = "mpg ~ disp"
formula_2 = "mpg ~ log(disp)"
formula_3 = "mpg ~ disp + hp"

lst <- list(formula_1, formula_2, formula_3)
  
models<- lst %>% map(~lm(., mtcars))
stargazer(models, type = "text")

Which gives me the output I'm looking for:

#> 
#> =========================================================================================
#>                                              Dependent variable:                         
#>                     ---------------------------------------------------------------------
#>                                                      mpg                                 
#>                              (1)                     (2)                    (3)          
#> -----------------------------------------------------------------------------------------
#> disp                      -0.041***                                      -0.030***       
#>                            (0.005)                                        (0.007)        
#>                                                                                          
#> log(disp)                                         -9.293***                              
#>                                                    (0.787)                               
#>                                                                                          
#> hp                                                                        -0.025*        
#>                                                                           (0.013)        
#>                                                                                          
#> Constant                  29.600***               69.205***              30.736***       
#>                            (1.230)                 (4.185)                (1.332)        
#>                                                                                          
#> -----------------------------------------------------------------------------------------
#> Observations                  32                     32                      32          
#> R2                          0.718                   0.823                  0.748         
#> Adjusted R2                 0.709                   0.817                  0.731         
#> Residual Std. Error    3.251 (df = 30)         2.579 (df = 30)        3.127 (df = 29)    
#> F Statistic         76.513*** (df = 1; 30) 139.350*** (df = 1; 30) 43.095*** (df = 2; 29)
#> =========================================================================================
#> Note:                                                         *p<0.1; **p<0.05; ***p<0.01

First Question:

How can I put all of the formulae into a list when there are many formula? The line below works if there are only 3 formulae, but seems clumsy when there are many models to be estimated.

lst <- list(formula_1, formula_2, formula_3)

Follow up Question:

Is there a better way to accomplish the entire task, using say broom or another method? Or is purrr:map() a reasonable solution?


Solution

  • Here is a workflow I would suggest. We can use nested tibbles to structure our data and use broom to get tidy estimates and fitted values:

    library(tidyverse)
    library(broom)
    
    # Created nested tibble
    nested_df <- tibble(formula = c("mpg ~ disp", "mpg ~ log(disp)", "mpg ~ disp + hp")) %>%
      group_by(ID = formula) %>%
      group_modify(~ as_tibble(mtcars)) %>%
      nest() 
    
    # Get model estimates
    nested_df %>%
      mutate(estimates = data %>% map2(ID, ~ tidy(lm(.y, data = .x)))) %>%
      select(-data) %>%
      unnest()
    
    # Get fitted values and residuals
    nested_df %>%
      mutate(model = ID %>% map2(data, lm),
             stats = model %>% map(augment)) %>%
      select(-data, -model) %>%
      unnest() 
    

    Output:

    > nested_df
    # A tibble: 3 x 2
      ID              data              
      <chr>           <list>            
    1 mpg ~ disp      <tibble [32 x 11]>
    2 mpg ~ disp + hp <tibble [32 x 11]>
    3 mpg ~ log(disp) <tibble [32 x 11]>
    
    # A tibble: 7 x 6
      ID              term        estimate std.error statistic  p.value
      <chr>           <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    1 mpg ~ disp      (Intercept)  29.6      1.23        24.1  3.58e-21
    2 mpg ~ disp      disp         -0.0412   0.00471     -8.75 9.38e-10
    3 mpg ~ disp + hp (Intercept)  30.7      1.33        23.1  3.26e-20
    4 mpg ~ disp + hp disp         -0.0303   0.00740     -4.10 3.06e- 4
    5 mpg ~ disp + hp hp           -0.0248   0.0134      -1.86 7.37e- 2
    6 mpg ~ log(disp) (Intercept)  69.2      4.19        16.5  1.28e-16
    7 mpg ~ log(disp) log(disp)    -9.29     0.787      -11.8  8.40e-13
    
    # A tibble: 96 x 12
       ID           mpg  disp .fitted .se.fit .resid   .hat .sigma  .cooksd .std.resid    hp log.disp.
       <chr>      <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl> <dbl>     <dbl>
     1 mpg ~ disp  21    160     23.0   0.664 -2.01  0.0418   3.29 0.00865      -0.630    NA        NA
     2 mpg ~ disp  21    160     23.0   0.664 -2.01  0.0418   3.29 0.00865      -0.630    NA        NA
     3 mpg ~ disp  22.8  108     25.1   0.815 -2.35  0.0629   3.28 0.0187       -0.746    NA        NA
     4 mpg ~ disp  21.4  258     19.0   0.589  2.43  0.0328   3.27 0.00983       0.761    NA        NA
     5 mpg ~ disp  18.7  360     14.8   0.838  3.94  0.0663   3.22 0.0558        1.25     NA        NA
     6 mpg ~ disp  18.1  225     20.3   0.575 -2.23  0.0313   3.28 0.00782      -0.696    NA        NA
     7 mpg ~ disp  14.3  360     14.8   0.838 -0.462 0.0663   3.31 0.000770     -0.147    NA        NA
     8 mpg ~ disp  24.4  147.    23.6   0.698  0.846 0.0461   3.30 0.00172       0.267    NA        NA
     9 mpg ~ disp  22.8  141.    23.8   0.714 -0.997 0.0482   3.30 0.00250      -0.314    NA        NA
    10 mpg ~ disp  19.2  168.    22.7   0.647 -3.49  0.0396   3.24 0.0248       -1.10     NA        NA
    # ... with 86 more rows
    

    If you prefer a stargazer table, we can also pull the model list-column out:

    library(stargazer)
    
    nested_df %>%
      mutate(model = ID %>% map2(data, ~ lm(.x, .y))) %>%
      pull(model) %>%
      stargazer(type = "text")
    

    Output:

    =========================================================================================
                                                 Dependent variable:                         
                        ---------------------------------------------------------------------
                                                         mpg                                 
                                 (1)                    (2)                     (3)          
    -----------------------------------------------------------------------------------------
    disp                      -0.041***              -0.030***                               
                               (0.005)                (0.007)                                
    
    hp                                                -0.025*                                
                                                      (0.013)                                
    
    log(disp)                                                                -9.293***       
                                                                              (0.787)        
    
    Constant                  29.600***              30.736***               69.205***       
                               (1.230)                (1.332)                 (4.185)        
    
    -----------------------------------------------------------------------------------------
    Observations                  32                     32                     32           
    R2                          0.718                  0.748                   0.823         
    Adjusted R2                 0.709                  0.731                   0.817         
    Residual Std. Error    3.251 (df = 30)        3.127 (df = 29)         2.579 (df = 30)    
    F Statistic         76.513*** (df = 1; 30) 43.095*** (df = 2; 29) 139.350*** (df = 1; 30)
    =========================================================================================
    Note:                                                         *p<0.1; **p<0.05; ***p<0.01
    

    Note that group_modify is currently experimental, so please use with caution, as its properties and intent may likely change in the future.

    Also see my other answer for a related problem: Place results of predict() in a for loop inside a list