Search code examples
rautomationregressiondata-analysis

How do I conduct hundreds of regressions, extract statistics and output these into a table?


I need to conduct hundreds of multiple regression models on several variables in my dataset. There are too many to run manually and I need to extract their results into one big table.

My real dataset contains data on blood protein concentrations but I have rewritten my code very simply for the mtcars dataset. In my dataset, p refers to one protein at a time of ~300 total proteins but here refers to columns/variables 2:5 of the mtcars dataset. The following code produces a long-form table containing the stats for each main effect and interaction.

library(tidyverse)

models <- lapply(mtcars[2:5], function(p) 
  (summary(lm(mpg ~ p * wt + qsec, data = mtcars))))
models_stats <- lapply(models, tidy) 
models_table <- do.call(rbind, models_stats)
models_table

OUTPUT: 
# A tibble: 20 × 5
   term        estimate std.error statistic     p.value
 * <chr>          <dbl>     <dbl>     <dbl>       <dbl>
 1 (Intercept)  41.6      8.20        5.08  0.0000247  
 2 p            -3.39     0.963      -3.52  0.00154    
 3 wt          -10.8      2.40       -4.51  0.000113   
 4 qsec          0.757    0.349       2.17  0.0390     
 5 p:wt          0.977    0.317       3.08  0.00473    
 6 (Intercept)  30.4      5.82        5.22  0.0000167  
 7 p            -0.0378   0.0138     -2.73  0.0109     
 8 wt           -7.61     1.26       -6.04  0.00000188 
 9 qsec          0.780    0.291       2.68  0.0123     
10 p:wt          0.0106   0.00298     3.55  0.00143    
11 (Intercept)  40.3      7.68        5.25  0.0000156  
12 p            -0.106    0.0263     -4.04  0.000395   
13 wt           -8.68     1.29       -6.72  0.000000328
14 qsec          0.503    0.361       1.39  0.174      
15 p:wt          0.0278   0.00730     3.81  0.000733   
16 (Intercept)  -7.79    11.3        -0.692 0.495      
17 p             7.52     2.81        2.68  0.0124     
18 wt            2.80     3.21        0.873 0.390      
19 qsec          0.874    0.246       3.55  0.00142    
20 p:wt         -2.12     0.926      -2.29  0.0301   

I am only interested in the interaction and so this is the only coefficient needing to be extracted. How can I: a) remove the p, wt, and qsec rows and keep only the q:wt row? b) make each row identifiable by its variable (i.e. protein name)? c) convert this into an easy to read table


Solution

  • You could try using grep for the interaction and then rename according to what you specify in your lapply:

    short_table <- models_table[grep("\\:", models_table$term),]
    short_table["p_name"] <- names(mtcars[2:5])
    

    Output:

    term  estimate std.error statistic  p.value p_name
      <chr>    <dbl>     <dbl>     <dbl>    <dbl> <chr> 
    1 p:wt    0.977    0.317        3.08 0.00473  cyl   
    2 p:wt    0.0106   0.00298      3.55 0.00143  disp  
    3 p:wt    0.0278   0.00730      3.81 0.000733 hp    
    4 p:wt   -2.12     0.926       -2.29 0.0301   drat 
    

    Alternatively you could modify the lapply statement and ignore the broom::tidy(). For instance:

    models2 <- lapply(mtcars[2:5], function(p) 
      (summary(lm(mpg ~ p * wt + qsec, data = mtcars)))$coefficients["p:wt",])
    
    models2_table <- do.call(rbind, models2)
    

    Which would provide all the statistics for the interaction term:

          Estimate Std. Error  t value    Pr(>|t|)
    cyl  0.7571610  0.3490572 2.169160 0.039044270
    disp 0.7796404  0.2905776 2.683071 0.012301653
    hp   0.5031633  0.3607684 1.394699 0.174476446
    drat 0.8739726  0.2458620 3.554728 0.001418666