Search code examples
rregressionlogistic-regressionglm

Fit a logistic regression model separately for each column in r


I'm new to r I need to run a loop or use sapply to fit a logistic regression model to each column (beginning with 'rs') separately with the below formula:

AD ~ [cov1:cov6] + rs1 then repeat for each rs2, rs3 etc.

I would also need to repeat the above code but change the formula to:

AD ~ [cov1:cov6] + rs1 * alcohol_intake then also repeat for each rs2, rs3 etc.

example data:

set.seed(543)
mydata <- data.frame(ID = 1:100, 
                     sex = sample(1:2, size = 500, replace = TRUE), 
                     age = runif(100, min= 35, max = 70), 
                     bmi = runif(100, min= 15, max = 35),
                     smoker = rep(c('smoker', 'not smoker')),
                     alcohol_intake = rep(c('regular', 'not regular')),
                     AD = sample(0:1, size = 500, replace = TRUE),
                     cov1 = runif(100, min= 0.0000, max = 1.0000),
                     cov2 = runif(100, min= -3.013e-02, max = 1.818e-02),
                     cov3 = runif(100, min= -3.562e-02, max = 1.540e-02),
                     cov4 = runif(100, min= -2.356e-02, max = 1.685e-02),
                     cov5 = runif(100, min= -1.392e-02, max = 2.894e-02),
                     cov6 = runif(100, min= -1.896e-02, max = 2.136e-02),
                     rs1 = sample(0:2, size = 500, replace = TRUE), 
                     rs2 = sample(0:2, size = 500, replace = TRUE),
                     rs3 = sample(0:2, size = 500, replace = TRUE),
                     rs4 = sample(0:2, size = 500, replace = TRUE),
                     rs5 = sample(0:2, size = 500, replace = TRUE),
                     rs6 = sample(0:2, size = 500, replace = TRUE),
                     rs7 = sample(0:2, size = 500, replace = TRUE),
                     rs8 = sample(0:2, size = 500, replace = TRUE),
                     rs9 = sample(0:2, size = 500, replace = TRUE),
                     rs10 = sample(0:2, size = 500, replace = TRUE),
                     rs11 = sample(0:2, size = 500, replace = TRUE),
                     rs12 = sample(0:2, size = 500, replace = TRUE),
                     rs13 = sample(0:2, size = 500, replace = TRUE),
                     rs14 = sample(0:2, size = 500, replace = TRUE),
                     rs15 = sample(0:2, size = 500, replace = TRUE),
                     rs16 = sample(0:2, size = 500, replace = TRUE),
                     rs17 = sample(0:2, size = 500, replace = TRUE),
                     rs18 = sample(0:2, size = 500, replace = TRUE),
                     rs19 = sample(0:2, size = 500, replace = TRUE),
                     rs20 = sample(0:2, size = 500, replace = TRUE),
                     rs21 = sample(0:2, size = 500, replace = TRUE),
                     rs22 = sample(0:2, size = 500, replace = TRUE),
                     rs23 = sample(0:2, size = 500, replace = TRUE),
                     rs24 = sample(0:2, size = 500, replace = TRUE),
                     rs25 = sample(0:2, size = 500, replace = TRUE),
                     rs26 = sample(0:2, size = 500, replace = TRUE),
                     rs27 = sample(0:2, size = 500, replace = TRUE),
                     rs28 = sample(0:2, size = 500, replace = TRUE),
                     rs29 = sample(0:2, size = 500, replace = TRUE),
                     rs30 = sample(0:2, size = 500, replace = TRUE)
)

Thanks.


Solution

  • Building on my comment in my previous answer that I feel conversion to long format would be beneficial in the long run...

    Converting to long format is simple:

    mydataLong <- mydata %>% 
                    pivot_longer(
                      cols=starts_with("rs"), 
                      names_to="Variable", 
                      values_to="Value"
                    )
    

    Which gives a data frame that looks like this (dropping some columns to display the section of the data frame that is of interest):

    mydataLong %>% select(AD, starts_with("cov"), Variable, Value)
    # A tibble: 15,000 × 9
          AD  cov1    cov2    cov3   cov4     cov5    cov6 Variable Value
       <int> <dbl>   <dbl>   <dbl>  <dbl>    <dbl>   <dbl> <chr>    <int>
     1     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs1          0
     2     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs2          0
     3     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs3          1
     4     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs4          0
     5     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs5          2
     6     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs6          1
     7     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs7          0
     8     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs8          0
     9     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs9          1
    10     0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs10         2
    

    To exactly reproduce the analysis requested by OP, I can use the tidyverse's group_map() function. group_map applies a function to a grouped data frame and returns a list whose elements are the results of applying the function provided to each group of the data frame in turn. The function should have two arguments, conventionally named .x and .y. .x contains the data in the current group. .y contains a single row and only those columns that define the current group.

    resultList <- mydataLong %>%
                    group_by(Variable) %>% 
                    group_map(
                    function(.x, .y) {
                      glm(
                        AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value, 
                        data=.x, 
                        family="binomial"
                      )
                    }
                  )
    resultList[[17]]
    Call:  glm(formula = AD ~ cov1 + cov2 + cov3 + cov4 + cov5 + cov6 + 
        Value, family = "binomial", data = .x)
    
    Coefficients:
    (Intercept)         cov1         cov2         cov3         cov4         cov5         cov6        Value  
         0.1370       0.0422       3.9611       1.0340     -10.2308       3.5968       3.1808      -0.1629  
    
    Degrees of Freedom: 499 Total (i.e. Null);  492 Residual
    Null Deviance:      693.1 
    Residual Deviance: 688.5    AIC: 704.5
    

    The advantage of this approach is that it is robust. Once the data frame is in long format, the same code will work correctly for any number of response variables, and regardless of what for their names take. That's not true of the "wide format" solution. For example, OP pointed out that in their real use case, response variables had names that look like rs65387690_A rather than rs1. That required a change to the code. That's not necessary here: a source of error has been removed and the maintenance of production code has been reduced.

    I think it's also worth mentioning the broom package here. broom attempts to remove inconsistency in the format of objects returned by various analysis functions: it converts the object returned to a data frame and attempts into standardise column names. Essentially, it does this by defining a generic function tidy(). If broom doesn't provide a tidy() method for your particular flavour of results object, it's pretty easy to write your own.

    So, adapting my solution to use broom and adding a call to bind_rows() to the end of the pipe, I get

    library(broom)
    
    mydataLong %>% 
     group_by(Variable) %>% 
     group_map(
       function(.x, .y) {
         glm(
           AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value, 
           data=.x, 
           family="binomial"
         ) %>% 
         tidy() %>% 
           add_column(Variable=.y$Variable, .before=1)
       }
    ) %>% 
    bind_rows()
    # A tibble: 240 × 6
       Variable term        estimate std.error statistic p.value
       <chr>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
     1 rs1      (Intercept)  -0.308      0.245   -1.26    0.209 
     2 rs1      cov1          0.0277     0.354    0.0784  0.938 
     3 rs1      cov2          3.67       6.79     0.540   0.589 
     4 rs1      cov3          0.362      6.01     0.0602  0.952 
     5 rs1      cov4         -8.92       8.09    -1.10    0.270 
     6 rs1      cov5          4.48       7.11     0.630   0.529 
     7 rs1      cov6          2.27       8.06     0.282   0.778 
     8 rs1      Value         0.263      0.114    2.30    0.0214
     9 rs10     (Intercept)  -0.0858     0.248   -0.346   0.729 
    10 rs10     cov1          0.0583     0.352    0.166   0.869 
    # … with 230 more rows
    

    So the results of all the analyses are in a single data frame, which I find much easier to handle downstream than a list of individual objects.

    broom::tidy() can be used in the "wide format" solution as well as this "long-format" variation.