Search code examples
rdplyrpurrrgeneticsbroom

How to iterate through independent variables, performing multiple linear regressions using the tidyverse framework?


My data:

library(tidyverse)
library(lme4)

myData <- structure(list(Subjects = c(4L, 3L, 5L, 1L, 9L, 6L, 10L, 2L, 
8L, 7L), Gene1 = c(0.318630087617032, -0.58179068471591, 0.714532710891568, 
-0.825259425862769, -0.359862131395465, 0.0898861437775305, 0.0962744602851301, 
-0.201633952183354, 0.739840499878431, 0.123379501088869), Variant1 = c(1L, 
0L, 1L, 2L, 2L, 1L, 0L, 1L, 2L, 0L), Variant2 = c(0L, 0L, 2L, 
2L, 0L, 2L, 2L, 2L, 2L, 0L), Variant3 = c(1L, 1L, 0L, 2L, 0L, 
1L, 1L, 1L, 2L, 1L), Variant4 = c(1L, 2L, 1L, 0L, 0L, 1L, 0L, 
2L, 1L, 1L), Age = c(81L, 60L, 85L, 87L, 76L, 78L, 88L, 64L, 
90L, 75L), Sex = c(0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L), RIN = c(6L, 
6L, 8L, 6L, 8L, 7L, 8L, 7L, 7L, 6L), ABG = structure(c(4L, 5L, 
8L, 3L, 6L, 2L, 3L, 4L, 7L, 1L), .Label = c("F1", "F10", "F2", 
"F3", "F4", "F5", "F6", "F8"), class = "factor")), row.names = c(NA, 
-10L), class = "data.frame", .Names = c("Subjects", "Gene1", 
"Variant1", "Variant2", "Variant3", "Variant4", "Age", "Sex", 
"RIN", "ABG"))

myData
   Subjects       Gene1 Variant1 Variant2 Variant3 Variant4 Age Sex RIN ABG
1         4  0.31863009        1        0        1        1  81   0   6  F3
2         3 -0.58179068        0        0        1        2  60   1   6  F4
3         5  0.71453271        1        2        0        1  85   0   8  F8
4         1 -0.82525943        2        2        2        0  87   1   6  F2
5         9 -0.35986213        2        0        0        0  76   0   8  F5
6         6  0.08988614        1        2        1        1  78   1   7 F10
7        10  0.09627446        0        2        1        0  88   0   8  F2
8         2 -0.20163395        1        2        1        2  64   0   7  F3
9         8  0.73984050        2        2        2        1  90   1   7  F6
10        7  0.12337950        0        0        1        1  75   1   6  F1

Gene1 is my dependent variable and Variant1, Variant2, Variant3 and Variant4 are my independent variables. Age, Sex, RIN and ABG are my covariates. Using the tidyverse framework (broom/dplyr/purrr/map) I'd like to iterate through Variant1:Variant4 performing the following linear regressions using a linear mixed model:

lmer(Gene1~Variant1+Age+Sex+RIN+(1|ABG), myData) for Variant1,
lmer(Gene1~Variant2+Age+Sex+RIN+(1|ABG), myData) for Variant2, so on ...

At the end, I'd like generate a results table with beta coefficients (Estimate), Std.Err and pValues for all Variant* (possibly using tidy/augment/glance??).

PS. The number of Variant* my vary.

Thank you!


Solution

  • For your problem you can gather() in order to utilize group_split() on all the different kinds of variants. From that point we can iterate over each split data.frame and run the linear model. Inside of the map() we'll broom::tidy() the data and add a column to distinguish the estimates for each model. I used map_df() to end up with a single dataframe but you can also just use map() to end up with a list of data.frames.

    library(tidyverse)
    library(lme4)
    library(broom)
    
    dat <- tribble(~Subjects, ~Gene1, ~Variant1, ~Variant2, ~Variant3, ~Variant4, ~Age, ~Sex, ~RIN, ~ABG,
            1,     -0.82525943,  2,      2,       2,       0,     87,   1,   6,     "F2",
            2,     -0.20163395,  1,      2,       1,       2, 64  , 0 ,  7,       "F3",
            3,     -0.58179068,  0,      0,       1,       2, 60  , 1 ,  6,       "F4",
            4,      0.31863009,  1,      0,       1,       1, 81  , 0 ,  6,       "F3",
            5,      0.71453271,  1,      2,       0,       1, 85  , 0 ,  8,       "F8",
            6,      0.08988614,  1,      2,       1,       1, 78  , 1 ,  7,      "F10",
            7,      0.12337950,  0,      0,       1,       1, 75  , 1 ,  6,       "F1",
            8,      0.73984050,  2,      2,       2,       1, 90  , 1 ,  7,       "F6",
            9,     -0.35986213,  2,      0,       0,       0, 76  , 0 ,  8,       "F5",
            10,     0.09627446,  0,      2,       1,       0,  88,   0,   8,       "F2")
    
    dat %>% 
      gather(key = "variants", value = "var_value", Variant1:Variant4) %>% 
      group_split(variants) %>% 
      map_df(~lmer(Gene1~var_value+Age+Sex+RIN+(1|ABG), data = .x) %>% 
            tidy() %>% 
            mutate(variant_group = unique(.x$variants)))
    #> # A tibble: 28 x 6
    #>    term                  estimate std.error statistic group   variant_group
    #>    <chr>                    <dbl>     <dbl>     <dbl> <chr>   <chr>        
    #>  1 (Intercept)           -3.91      1.96     -2.00    fixed   Variant1     
    #>  2 var_value             -0.280     0.120    -2.34    fixed   Variant1     
    #>  3 Age                    0.0401    0.00905   4.43    fixed   Variant1     
    #>  4 Sex                    0.00107   0.391     0.00274 fixed   Variant1     
    #>  5 RIN                    0.161     0.154     1.05    fixed   Variant1     
    #>  6 sd_(Intercept).ABG     0.462    NA        NA       ABG     Variant1     
    #>  7 sd_Observation.Resid…  0.00234  NA        NA       Residu… Variant1     
    #>  8 (Intercept)           -3.60      2.74     -1.31    fixed   Variant2     
    #>  9 var_value             -0.0625    0.202    -0.309   fixed   Variant2     
    #> 10 Age                    0.0295    0.0175    1.69    fixed   Variant2     
    #> # … with 18 more rows
    

    Created on 2019-02-23 by the reprex package (v0.2.1)