Search code examples
rtidyverselinear-regressionbroom

How to perform multiple linear model fits of one column against all (pairwise) using dplyr and broom


Given the mtcars data:

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

What I want to do is to perform linear fit with y = mpg as the outcome with all other variables as predictors x , in one on one basis (i.e. from mpg ~ cyl up to mpg ~ carb).

How can I do that with dplyr?

So far I have this code (as suggested by Maurits):

 library(broom)
  library(tidyr)
  library(dplyr)
 lm(mpg ~ ., data = mtcars) %>% 
 # glance() # this fail
 tidy()

It gave me this result:

# A tibble: 11 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  12.3      18.7        0.657  0.518 
 2 cyl          -0.111     1.05      -0.107  0.916 
 3 disp          0.0133    0.0179     0.747  0.463 
 4 hp           -0.0215    0.0218    -0.987  0.335 
 5 drat          0.787     1.64       0.481  0.635 
 6 wt           -3.72      1.89      -1.96   0.0633
 7 qsec          0.821     0.731      1.12   0.274 
 8 vs            0.318     2.10       0.151  0.881 
 9 am            2.52      2.06       1.23   0.234 
10 gear          0.655     1.49       0.439  0.665 
11 carb         -0.199     0.829     -0.241  0.812 

But when I do

> tidy(lm(mpg ~ cyl, data = mtcars))
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.9      2.07      18.3  8.37e-18
2 cyl            -2.88     0.322     -8.92 6.11e-10

Notice the difference in estimate from -2.88 to -0.11.


Solution

  • You could reshape, groupby and then do the lm:

    library(tidyverse)
    mtcars %>%
      pivot_longer(-mpg) %>%
      group_by(name) %>%
      summarise(broom::tidy(lm(mpg~value, cur_data())), .groups='drop')
    
     name  term        estimate std.error statistic  p.value
       <chr> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
     1 am    (Intercept)  17.1      1.12       15.2   1.13e-15
     2 am    value         7.24     1.76        4.11  2.85e- 4
     3 carb  (Intercept)  25.9      1.84       14.1   9.22e-15
     4 carb  value        -2.06     0.569      -3.62  1.08e- 3
     5 cyl   (Intercept)  37.9      2.07       18.3   8.37e-18
     6 cyl   value        -2.88     0.322      -8.92  6.11e-10
     7 disp  (Intercept)  29.6      1.23       24.1   3.58e-21
     8 disp  value        -0.0412   0.00471    -8.75  9.38e-10
     9 drat  (Intercept)  -7.52     5.48       -1.37  1.80e- 1
    10 drat  value         7.68     1.51        5.10  1.78e- 5
    11 gear  (Intercept)   5.62     4.92        1.14  2.62e- 1
    12 gear  value         3.92     1.31        3.00  5.40e- 3
    13 hp    (Intercept)  30.1      1.63       18.4   6.64e-18
    14 hp    value        -0.0682   0.0101     -6.74  1.79e- 7
    15 qsec  (Intercept)  -5.11    10.0        -0.510 6.14e- 1
    16 qsec  value         1.41     0.559       2.53  1.71e- 2
    17 vs    (Intercept)  16.6      1.08       15.4   8.85e-16
    18 vs    value         7.94     1.63        4.86  3.42e- 5
    19 wt    (Intercept)  37.3      1.88       19.9   8.24e-19
    20 wt    value        -5.34     0.559      -9.56  1.29e-10
    

    If only interested in the coefficient values, you can directly use base R:

    lm(mpg~ind/values+0, cbind(mpg=mtcars$mpg,stack(mtcars, -mpg)))
    
    Call:
    lm(formula = mpg ~ ind/values + 0, data = cbind(mpg = mtcars$mpg, 
        stack(mtcars, -mpg)))
    
    Coefficients:
            indcyl         inddisp           indhp         inddrat           indwt  
          37.88458        29.59985        30.09886        -7.52462        37.28513  
           indqsec           indvs           indam         indgear         indcarb  
          -5.11404        16.61667        17.14737         5.62333        25.87233  
     indcyl:values  inddisp:values    indhp:values  inddrat:values    indwt:values  
          -2.87579        -0.04122        -0.06823         7.67823        -5.34447  
    indqsec:values    indvs:values    indam:values  indgear:values  indcarb:values  
           1.41212         7.94048         7.24494         3.92333        -2.05572 
    

    Note that the indVAR is the intercept of the variable and indVAR:value is the coefficient. eg intercept for cyl = 37.88458 while the coefficient for cyl = -2.87579. This is the same results given by the tidyverse function