Search code examples
rdataframefor-looplinear-regressionrna-seq

for-loop linear regression generation new dataframe with the results


I want to write a loop on R to perform linear regression on my dataset genes (= 210011 genes and 6 samples total; with columns the gene and the rows the samples) to identify how age and sex affect gene expression. I want to save the fitted value output from the linear regression in a new dataframe (generating basically a similar dataframe where on the columns there are the genes and in the rows the samples).

so the loop I wrote is:

genelist <- df %>% select(5:21011) #select only genes
for (i in 1:length(genelist)) {
  formula <- as.formula(paste0(genelist[i], ' ~ age + sex'))
  model <- lm(formula, data = df)
  print(model$fitted.values)
}

but I am not able to save a new dataframe. I tried to follow this How to Loop/Repeat a Linear Regression in R

test <- list(); model <- list()
for (i in 1:length(genelist)) {
  formula[[i]] = paste0(genelist[i], ' ~ age + sex')
  model[[i]] = lm(formula[[i]], data = df) 
}

but it gives me "list of 0" as results so I must have written something wrong. How can I modif my original code generating a new dataframe with the results?

thanks for help help!


Solution

  • Here's an example that works:

    library(dplyr)
    set.seed(2053)
    df <- data.frame(age = sample(18:80, 6, replace=FALSE), 
                     sex = sample(0:1, 6, replace=TRUE))
    for(i in 1:10){
      df[[paste0("gene_", i)]] <- runif(6,0,1)
    }
    
    genelist <- df %>% select(3:12) #select only genes
    pred <- df %>% select(age, sex)
    sigs <- NULL
    for (i in 1:length(genelist)) {
      formula <- reformulate(c("age", "sex"), response=names(genelist)[i])  
      model <- lm(formula, data = df)
      pred[[names(genelist)[i]]] <- predict(model, newdata=pred)
      pvals <- summary(model)$coefficients[-1,4]
      pvals <- c(pvals, "F" = unname(pf(summary(model)$fstatistic[1], 
                                 summary(model)$fstatistic[2], 
                                 summary(model)$fstatistic[3], 
                                 lower.tail=FALSE)))
      sigs <- rbind(sigs, pvals)
    }
    rownames(sigs) <- colnames(genelist)
    
    pred
    #>   age sex    gene_1    gene_2      gene_3    gene_4      gene_5    gene_6
    #> 1  54   0 0.6460394 0.7975062 0.542963150 0.5766314  0.43716321 0.3731399
    #> 2  65   0 0.4969311 0.7557411 0.499976012 0.7201710 -0.02954846 0.3392473
    #> 3  49   0 0.7138160 0.8164903 0.562502758 0.5113862  0.64930488 0.3885457
    #> 4  62   0 0.5375970 0.7671316 0.511699777 0.6810238  0.09773654 0.3484907
    #> 5  44   0 0.7815925 0.8354744 0.582042366 0.4461409  0.86144655 0.4039515
    #> 6  40   1 0.3976764 0.3673542 0.009429805 0.2500409  0.38185899 0.5017752
    #>      gene_7    gene_8     gene_9   gene_10
    #> 1 0.6990817 0.6336038 0.36330413 0.3146205
    #> 2 0.6414371 0.8336259 0.58575121 0.2651734
    #> 3 0.7252838 0.5426847 0.26219181 0.3370964
    #> 4 0.6571584 0.7790744 0.52508383 0.2786590
    #> 5 0.7514859 0.4517656 0.16107950 0.3595723
    #> 6 0.1903702 0.9501972 0.09472406 0.6118369
    
    sigs
    #>                  age         sex          F
    #> gene_1  0.5736844190 0.462726187 0.72223654
    #> gene_2  0.6526877593 0.079265450 0.12862783
    #> gene_3  0.8493679497 0.289034889 0.44026028
    #> gene_4  0.6573230145 0.837650010 0.73697690
    #> gene_5  0.0004498121 0.001752855 0.00105546
    #> gene_6  0.8812687531 0.864282218 0.92669812
    #> gene_7  0.7960359480 0.286213514 0.45291886
    #> gene_8  0.2845571231 0.190629747 0.35754969
    #> gene_9  0.1456621812 0.957422081 0.19649049
    #> gene_10 0.7587574987 0.521719609 0.54628975
    

    Created on 2022-10-18 by the reprex package (v2.0.1)

    In the example above, pred is the answer to the original question. In the comments, you asked about identifying whether the models were significant. The object sigs captures the p-values from both the age and sex variables as well as the p-value for the model's omnibus F-statistic.