I have gene expression data as number of counts for each probe, something like this:
library(data.table)
mydata <- fread(
"molclass,mol.id,sample1,sample2,sample3
negative, negat1, 0, 1, 2
negative, negat2, 2, 1, 1
negative, negat3, 1, 2, 0
endogen, gene1, 30, 15, 10
endogen, gene2, 60, 30, 20
")
My question here is - what would be the best way to perform background subtraction, i.e. for each sampleN
column I need to calculate background (let's say it will be the average of all values from negative
class) and then subtract this background from each value of this column. For the moment I am using the following solution:
for (nm in names(mydata)[-c(1:2)]) {
bg <- mydata[molclass=='negative', nm, with=F];
bg <- mean(unlist(bg));
mydata[[nm]] <- (mydata[[nm]] - bg);
}
but I feel there must be some "nicer" way.
P.S. I know that there are some packages that do those things, but my data correspond to the number of counts, not intensity of signal - so I can't use limma
or similar tools designed for microarrays. Maybe some seq-data packages could help, but I am not sure because my data is not from sequencing either.
Generally, you shouldn't use <-
with a data.table
. The last assignment in your loop would be better with set
. See the help page by typing ?set
for details.
mycols <- paste0('sample',1:3)
newcols <- paste0(mycols,'bk')
s <- mydata[['molclass']] == 'negative'
mybkds <- sapply(mycols,function(j) mean(mydata[[j]][s]) )
mydata[,(newcols):=NA]
for (j in mycols) set(mydata,j=paste0(j,'bk'),value=mydata[[j]]-mybkds[j])
I've only done the last step in a loop, but this is basically the same as your code (where everything is in the loop). *apply
functions and loops are just different syntax, I've heard, and you can go with whichever you prefer.