I'm new to r I need to run a loop or use sapply to fit a logistic regression model to each column (beginning with 'rs') separately with the below formula:
AD ~ [cov1:cov6] + rs1 then repeat for each rs2, rs3 etc.
I would also need to repeat the above code but change the formula to:
AD ~ [cov1:cov6] + rs1 * alcohol_intake then also repeat for each rs2, rs3 etc.
example data:
set.seed(543)
mydata <- data.frame(ID = 1:100,
sex = sample(1:2, size = 500, replace = TRUE),
age = runif(100, min= 35, max = 70),
bmi = runif(100, min= 15, max = 35),
smoker = rep(c('smoker', 'not smoker')),
alcohol_intake = rep(c('regular', 'not regular')),
AD = sample(0:1, size = 500, replace = TRUE),
cov1 = runif(100, min= 0.0000, max = 1.0000),
cov2 = runif(100, min= -3.013e-02, max = 1.818e-02),
cov3 = runif(100, min= -3.562e-02, max = 1.540e-02),
cov4 = runif(100, min= -2.356e-02, max = 1.685e-02),
cov5 = runif(100, min= -1.392e-02, max = 2.894e-02),
cov6 = runif(100, min= -1.896e-02, max = 2.136e-02),
rs1 = sample(0:2, size = 500, replace = TRUE),
rs2 = sample(0:2, size = 500, replace = TRUE),
rs3 = sample(0:2, size = 500, replace = TRUE),
rs4 = sample(0:2, size = 500, replace = TRUE),
rs5 = sample(0:2, size = 500, replace = TRUE),
rs6 = sample(0:2, size = 500, replace = TRUE),
rs7 = sample(0:2, size = 500, replace = TRUE),
rs8 = sample(0:2, size = 500, replace = TRUE),
rs9 = sample(0:2, size = 500, replace = TRUE),
rs10 = sample(0:2, size = 500, replace = TRUE),
rs11 = sample(0:2, size = 500, replace = TRUE),
rs12 = sample(0:2, size = 500, replace = TRUE),
rs13 = sample(0:2, size = 500, replace = TRUE),
rs14 = sample(0:2, size = 500, replace = TRUE),
rs15 = sample(0:2, size = 500, replace = TRUE),
rs16 = sample(0:2, size = 500, replace = TRUE),
rs17 = sample(0:2, size = 500, replace = TRUE),
rs18 = sample(0:2, size = 500, replace = TRUE),
rs19 = sample(0:2, size = 500, replace = TRUE),
rs20 = sample(0:2, size = 500, replace = TRUE),
rs21 = sample(0:2, size = 500, replace = TRUE),
rs22 = sample(0:2, size = 500, replace = TRUE),
rs23 = sample(0:2, size = 500, replace = TRUE),
rs24 = sample(0:2, size = 500, replace = TRUE),
rs25 = sample(0:2, size = 500, replace = TRUE),
rs26 = sample(0:2, size = 500, replace = TRUE),
rs27 = sample(0:2, size = 500, replace = TRUE),
rs28 = sample(0:2, size = 500, replace = TRUE),
rs29 = sample(0:2, size = 500, replace = TRUE),
rs30 = sample(0:2, size = 500, replace = TRUE)
)
Thanks.
Building on my comment in my previous answer that I feel conversion to long format would be beneficial in the long run...
Converting to long format is simple:
mydataLong <- mydata %>%
pivot_longer(
cols=starts_with("rs"),
names_to="Variable",
values_to="Value"
)
Which gives a data frame that looks like this (dropping some columns to display the section of the data frame that is of interest):
mydataLong %>% select(AD, starts_with("cov"), Variable, Value)
# A tibble: 15,000 × 9
AD cov1 cov2 cov3 cov4 cov5 cov6 Variable Value
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
1 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs1 0
2 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs2 0
3 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs3 1
4 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs4 0
5 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs5 2
6 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs6 1
7 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs7 0
8 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs8 0
9 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs9 1
10 0 0.490 -0.0201 -0.0277 0.0142 -0.00377 0.00758 rs10 2
To exactly reproduce the analysis requested by OP, I can use the tidyverse's group_map()
function. group_map
applies a function to a grouped data frame and returns a list whose elements are the results of applying the function provided to each group of the data frame in turn. The function should have two arguments, conventionally named .x
and .y
. .x
contains the data in the current group. .y
contains a single row and only those columns that define the current group.
resultList <- mydataLong %>%
group_by(Variable) %>%
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value,
data=.x,
family="binomial"
)
}
)
resultList[[17]]
Call: glm(formula = AD ~ cov1 + cov2 + cov3 + cov4 + cov5 + cov6 +
Value, family = "binomial", data = .x)
Coefficients:
(Intercept) cov1 cov2 cov3 cov4 cov5 cov6 Value
0.1370 0.0422 3.9611 1.0340 -10.2308 3.5968 3.1808 -0.1629
Degrees of Freedom: 499 Total (i.e. Null); 492 Residual
Null Deviance: 693.1
Residual Deviance: 688.5 AIC: 704.5
The advantage of this approach is that it is robust. Once the data frame is in long format, the same code will work correctly for any number of response variables, and regardless of what for their names take. That's not true of the "wide format" solution. For example, OP pointed out that in their real use case, response variables had names that look like rs65387690_A
rather than rs1
. That required a change to the code. That's not necessary here: a source of error has been removed and the maintenance of production code has been reduced.
I think it's also worth mentioning the broom
package here. broom
attempts to remove inconsistency in the format of objects returned by various analysis functions: it converts the object returned to a data frame and attempts into standardise column names. Essentially, it does this by defining a generic function tidy()
. If broom
doesn't provide a tidy()
method for your particular flavour of results object, it's pretty easy to write your own.
So, adapting my solution to use broom
and adding a call to bind_rows()
to the end of the pipe, I get
library(broom)
mydataLong %>%
group_by(Variable) %>%
group_map(
function(.x, .y) {
glm(
AD ~ cov1 + cov2 + cov3 + cov4 +cov5 + cov6 + Value,
data=.x,
family="binomial"
) %>%
tidy() %>%
add_column(Variable=.y$Variable, .before=1)
}
) %>%
bind_rows()
# A tibble: 240 × 6
Variable term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 rs1 (Intercept) -0.308 0.245 -1.26 0.209
2 rs1 cov1 0.0277 0.354 0.0784 0.938
3 rs1 cov2 3.67 6.79 0.540 0.589
4 rs1 cov3 0.362 6.01 0.0602 0.952
5 rs1 cov4 -8.92 8.09 -1.10 0.270
6 rs1 cov5 4.48 7.11 0.630 0.529
7 rs1 cov6 2.27 8.06 0.282 0.778
8 rs1 Value 0.263 0.114 2.30 0.0214
9 rs10 (Intercept) -0.0858 0.248 -0.346 0.729
10 rs10 cov1 0.0583 0.352 0.166 0.869
# … with 230 more rows
So the results of all the analyses are in a single data frame, which I find much easier to handle downstream than a list of individual objects.
broom::tidy()
can be used in the "wide format" solution as well as this "long-format" variation.