I have a dataframe dfA (the real one has 1000 rows and 400,000 columns). From column 6 on, the variable names are "triads" formed by x with + different prefixes (GT_x, N_x, E_x), where x = rs1, rs7, rs300, rs502, etc:
ID SEX PV GAN GAE GT_rs1 N_rs1 E_rs1 GT_rs7 N_rs7 E_rs7 ...
2 0 7.8 0.3 0.4 0 1 1 1 0 2 ...
6 1 6.4 0.35 0.55 0 0 1 1 1 2 ...
Here is a reproducible example of my data:
dfA = data.frame(rbind(c("ID","SEX","PV","GAN","GAE","GT_rs1","N_rs1","E_rs1","GT_rs7","N_rs7","E_rs7"),
c(2,0,7.8,0.3,0.4,0,1,1,1,0,2),
c(6,1,6.4,0.35,0.55,0,0,1,1,1,2)))
dfA = dfA %>% row_to_names(row_number = 1)
Using R, I want to run a linear regression of the form:
lm(PV ~ SEX + GAN + GT_x + N_x)
where x is rs1, rs7 and so on. So, I'd need to iterate column-wise over pairs of variables. I would like to get estimate, std.error, statistic and p.value for the different covariates (SEX, GAN, GT_x and N_x). SEX = categorical variable; PV, GAN = quantitative variables; GT_x, N_x, E_x = additive variables.
You can build formulas by pasting together strings - we just need to know the strings you want to paste together.
This should work - it's untested because the data you share isn't shared with dput
so it's not copy/pasteable, and it only has one set of covariates so it doesn't illustrate the complexity of the problem. If you have issues, please share copy/pasteable data to illustrate and I'll try to debug.
library(stringr)
library(dplyr)
library(broom)
# get all unique strings after underscores from your column names
suffix = str_extract(names(dfA), "_.*") %>% na.omit %>% unique
prefix = c("GT", "N")
base_formula = "PV ~ SEX + GAN +"
full_formula = paste(base_formula, paste0(prefix[1], suffix), "+", paste0(prefix[2], suffix))
mods = list()
for(i in seq_along(full_formula)) {
mods[[suffix[i]]] = lm(as.formula(full_formula[i]), data = dfA)
}
stats = lapply(mods, tidy)
stats = bind_rows(stats, .id = "suffix")