Search code examples
rapplylapplylmsapply

Performing a linear model in R of a single response with a single predictor from a large dataframe and repeat for each column


It might not be very clear from the title but what I wish to do is:

  1. I have a dataframe df with, say, 200 columns and the first 80 columns are response variables (y1, y2, y3, ...) and the rest of 120 are predictors (x1, x2, x3, ...).

  2. I wish to compute a linear model for each pair – lm(yi ~ xi, data = df).

  3. Many problems and solutions I have looked through online have a either a fixed response vs many predictors or the other way around, using lapply() and its related functions.

Could anyone who is familiar with it point me to the right step?


Solution

  • use tidyverse

    library(tidyverse)
    library(broom)
    df <- mtcars
    y <- names(df)[1:3]
    x <- names(df)[4:7]
    
    result <- expand_grid(x, y) %>% 
      rowwise() %>% 
      mutate(frm = list(reformulate(x, y)),
             model = list(lm(frm, data = df))) 
    
    result$model <- purrr::set_names(result$model, nm = paste0(result$y, " ~ ", result$x))
    
    result$model[1:2]
    #> $`mpg ~ hp`
    #> 
    #> Call:
    #> lm(formula = frm, data = df)
    #> 
    #> Coefficients:
    #> (Intercept)           hp  
    #>    30.09886     -0.06823  
    #> 
    #> 
    #> $`cyl ~ hp`
    #> 
    #> Call:
    #> lm(formula = frm, data = df)
    #> 
    #> Coefficients:
    #> (Intercept)           hp  
    #>     3.00680      0.02168
    
    map_df(result$model, tidy)
    #> # A tibble: 24 x 5
    #>    term        estimate std.error statistic  p.value
    #>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    #>  1 (Intercept)  30.1      1.63       18.4   6.64e-18
    #>  2 hp           -0.0682   0.0101     -6.74  1.79e- 7
    #>  3 (Intercept)   3.01     0.425       7.07  7.41e- 8
    #>  4 hp            0.0217   0.00264     8.23  3.48e- 9
    #>  5 (Intercept)  21.0     32.6         0.644 5.25e- 1
    #>  6 hp            1.43     0.202       7.08  7.14e- 8
    #>  7 (Intercept)  -7.52     5.48       -1.37  1.80e- 1
    #>  8 drat          7.68     1.51        5.10  1.78e- 5
    #>  9 (Intercept)  14.6      1.58        9.22  2.93e-10
    #> 10 drat         -2.34     0.436      -5.37  8.24e- 6
    #> # ... with 14 more rows
    
    map_df(result$model, glance)
    #> # A tibble: 12 x 12
    #>    r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
    #>        <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
    #>  1     0.602         0.589   3.86     45.5  1.79e- 7     1  -87.6 181.  186. 
    #>  2     0.693         0.683   1.01     67.7  3.48e- 9     1  -44.6  95.1  99.5
    #>  3     0.626         0.613  77.1      50.1  7.14e- 8     1 -183.  373.  377. 
    #>  4     0.464         0.446   4.49     26.0  1.78e- 5     1  -92.4 191.  195. 
    #>  5     0.490         0.473   1.30     28.8  8.24e- 6     1  -52.7 111.  116. 
    #>  6     0.504         0.488  88.7      30.5  5.28e- 6     1 -188.  382.  386. 
    #>  7     0.753         0.745   3.05     91.4  1.29e-10     1  -80.0 166.  170. 
    #>  8     0.612         0.599   1.13     47.4  1.22e- 7     1  -48.3 103.  107. 
    #>  9     0.789         0.781  57.9     112.   1.22e-11     1 -174.  355.  359. 
    #> 10     0.175         0.148   5.56      6.38 1.71e- 2     1  -99.3 205.  209. 
    #> 11     0.350         0.328   1.46     16.1  3.66e- 4     1  -56.6 119.  124. 
    #> 12     0.188         0.161 114.        6.95 1.31e- 2     1 -196.  398.  402. 
    #> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
    

    Created on 2020-12-11 by the reprex package (v0.3.0)