Search code examples
rdataframelme4genetics

How can I output residuals from multiple models in a single new data frame using R?


I would like to run multiple regression models for a static set of independent variables against multiple different dependent variables and output the residuals into a new file that looks like...

SampleID     site_residual1    site_residual2    site_residual3
F001         0.003             0.988             0.776
F001         0.002             0.876             0.665
F002         0.134             0.234             0.786
...

I have been using the following code to get a single residual output, but have been unsuccessful in implementing a loop that will run through all of my sites.

infile = sprintf("/path/siteinput.txt.gz")

infile looks like...

SampleID     site1  site2   site3   etc...
F001         0.003  0.988   0.776   etc...
F001         0.002  0.876   0.665   etc...
F002         0.134  0.234   0.786   etc...
...

...

pheno = read.table("/path/pheno_covar.txt", header=T, sep="\t")

pheno looks like...

SampleID     indep1 indep2  indep3  chip1   etc...
F001         0.003  0.988   0.776   2       etc...
F001         0.002  0.876   0.665   2       etc...
F002         0.134  0.234   0.786   1       etc...
...

...

residfile = sprintf("/path/test_resid_out.txt")

library(lme4)

beta = read.table(infile, header=T, sep="\t")

merged = merge(beta, pheno, by="SampleID")

site<-merged$site1
chip <- as.factor(merged$chip1)

model1 <- lmer (formula= site ~ indep1 +indep2 + indep3 + (1|chip), data=merged)

print(summary(model1))
print(resid(model1))

site1_resid = resid(model1, na.action=na.exclude)

residout<-(data.frame(SampleID, site1_resid))
write.table(residout, file=residfile, sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

And my output looks like...

SampleID    site1_resid
F001        0.0110177454696274
F002        0.0923483180517723
F003        0.103686493563883
F004        -0.106193404096636
F005        -0.124621172636435
....

...

So, I am really looking for a way to run model1 for each site in my "infile" and output all residuals into a new file. Also, I would like to have the column header to include the original name of the "site". I do have some missing information in (all covariates are complete, but some sites are missing for some IDs).

Any advice would be appreciated.


Solution

  • With the help of magrittr pipes (%>%) to make it easier to read (not necessary, though):

    library(magrittr)
    names(beta) %>% 
      setdiff("SampleID") %>% 
      setNames(., .) %>% 
      lapply(function(x) {
        model <- lmer(data = merged, formula = paste(x, "~ indep1 +indep2 + indep3 + (1|chip)"))
        # print(summary(model))
        # print(resid(model))
        resid(model, na.action=na.exclude)
      }) %>% 
      c(list(SampleID = merged$SampleID), .) %>% 
      do.call(what = "data.frame")
    

    (On a side note, it worries me that you have duplicate SampleIDs. Is it intended? If yes, are you sure you want to merge() by SampleID? Would you not rather do cbind(beta, pheno[, - 1, drop = FALSE])?)