I want to write a loop on R to perform linear regression on my dataset genes (= 210011 genes and 6 samples total; with columns the gene and the rows the samples) to identify how age and sex affect gene expression. I want to save the fitted value output from the linear regression in a new dataframe (generating basically a similar dataframe where on the columns there are the genes and in the rows the samples).
so the loop I wrote is:
genelist <- df %>% select(5:21011) #select only genes
for (i in 1:length(genelist)) {
formula <- as.formula(paste0(genelist[i], ' ~ age + sex'))
model <- lm(formula, data = df)
print(model$fitted.values)
}
but I am not able to save a new dataframe. I tried to follow this How to Loop/Repeat a Linear Regression in R
test <- list(); model <- list()
for (i in 1:length(genelist)) {
formula[[i]] = paste0(genelist[i], ' ~ age + sex')
model[[i]] = lm(formula[[i]], data = df)
}
but it gives me "list of 0" as results so I must have written something wrong. How can I modif my original code generating a new dataframe with the results?
thanks for help help!
Here's an example that works:
library(dplyr)
set.seed(2053)
df <- data.frame(age = sample(18:80, 6, replace=FALSE),
sex = sample(0:1, 6, replace=TRUE))
for(i in 1:10){
df[[paste0("gene_", i)]] <- runif(6,0,1)
}
genelist <- df %>% select(3:12) #select only genes
pred <- df %>% select(age, sex)
sigs <- NULL
for (i in 1:length(genelist)) {
formula <- reformulate(c("age", "sex"), response=names(genelist)[i])
model <- lm(formula, data = df)
pred[[names(genelist)[i]]] <- predict(model, newdata=pred)
pvals <- summary(model)$coefficients[-1,4]
pvals <- c(pvals, "F" = unname(pf(summary(model)$fstatistic[1],
summary(model)$fstatistic[2],
summary(model)$fstatistic[3],
lower.tail=FALSE)))
sigs <- rbind(sigs, pvals)
}
rownames(sigs) <- colnames(genelist)
pred
#> age sex gene_1 gene_2 gene_3 gene_4 gene_5 gene_6
#> 1 54 0 0.6460394 0.7975062 0.542963150 0.5766314 0.43716321 0.3731399
#> 2 65 0 0.4969311 0.7557411 0.499976012 0.7201710 -0.02954846 0.3392473
#> 3 49 0 0.7138160 0.8164903 0.562502758 0.5113862 0.64930488 0.3885457
#> 4 62 0 0.5375970 0.7671316 0.511699777 0.6810238 0.09773654 0.3484907
#> 5 44 0 0.7815925 0.8354744 0.582042366 0.4461409 0.86144655 0.4039515
#> 6 40 1 0.3976764 0.3673542 0.009429805 0.2500409 0.38185899 0.5017752
#> gene_7 gene_8 gene_9 gene_10
#> 1 0.6990817 0.6336038 0.36330413 0.3146205
#> 2 0.6414371 0.8336259 0.58575121 0.2651734
#> 3 0.7252838 0.5426847 0.26219181 0.3370964
#> 4 0.6571584 0.7790744 0.52508383 0.2786590
#> 5 0.7514859 0.4517656 0.16107950 0.3595723
#> 6 0.1903702 0.9501972 0.09472406 0.6118369
sigs
#> age sex F
#> gene_1 0.5736844190 0.462726187 0.72223654
#> gene_2 0.6526877593 0.079265450 0.12862783
#> gene_3 0.8493679497 0.289034889 0.44026028
#> gene_4 0.6573230145 0.837650010 0.73697690
#> gene_5 0.0004498121 0.001752855 0.00105546
#> gene_6 0.8812687531 0.864282218 0.92669812
#> gene_7 0.7960359480 0.286213514 0.45291886
#> gene_8 0.2845571231 0.190629747 0.35754969
#> gene_9 0.1456621812 0.957422081 0.19649049
#> gene_10 0.7587574987 0.521719609 0.54628975
Created on 2022-10-18 by the reprex package (v2.0.1)
In the example above, pred
is the answer to the original question. In the comments, you asked about identifying whether the models were significant. The object sigs
captures the p-values from both the age
and sex
variables as well as the p-value for the model's omnibus F-statistic.