NOTE: I'm not asking a Bioconductor-specific question, but I need Bioconductor in the example code. Bear with me please.
Hi,
I have a number of tab delimited files containing various types of information about specific genes. One or more of the columns can be Aliases to Gene Symbols that I need to upgrade to the latest Gene Symbol annotation.
I'm using Bioconductor's org.Hs.eg.db library to do so (the org.Hs.egALIAS2EG and org.Hs.egSYMBOL objects in particular).
The code reported does the job but is very slow, I guess because of the nested for loops that query the org.Hs.eg.db database at each iteration. Is there a quicker/simpler/smarter way to achieve the same result?
library(org.Hs.eg.db)
myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)
for (i in 1:nrow(myTable)) {
for (j in 1:ncol(myTable)) {
repl <- org.Hs.egALIAS2EG[[myTable[i,j]]][1]
if (!is.null(repl)) {
repl <- org.Hs.egSYMBOL[[repl]][1]
if (!is.null(repl)) {
myTable[i,j] <- repl
}
}
}
}
write.table(myTable, file="new_tab_delimited_file", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
I'm thinking to use one of the apply function, but bear in mind that org.Hs.egALIAS2EG and org.Hs.egSYMBOL are objects, and not functions.
Thank you!
This is the best I could come up with.
First write a function:
alias2GS <- function(x) {
for (i in 1:length(x)) {
if (!is.na(x[i])) {
repl <- org.Hs.egALIAS2EG[[x[i]]][1]
if (!is.null(repl)) {
repl <- org.Hs.egSYMBOL[[repl]][1]
if (!is.null(repl)) {
x[i] <- repl
}
}
}
}
return(x)
}
And then call the function for each column of the data frame where the conversion is needed, like so:
df$GeneSymbols <- alias2GS(df$GeneSymbols)