I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:
chrom pause strand coverage
1: 1 3025794 + 1
2: 1 3102057 + 2
3: 1 3102058 + 2
4: 1 3102078 + 1
5: 1 3108840 - 1
6: 1 3133041 + 1
I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve two related values in form of a string and a new coordinate. I want to add them to my table in two new columns, like this:
chrom pause strand coverage transcriptID CDS
1: 1 3025794 + 1 ENSMUST00000116652 196
2: 1 3102057 + 2 ENSMUST00000116652 35
3: 1 3102058 + 2 ENSMUST00000156816 888
4: 1 3102078 + 1 ENSMUST00000156816 883
5: 1 3108840 - 1 ENSMUST00000156816 882
6: 1 3133041 + 1 ENSMUST00000156816 880
The function is the following:
get_feature <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand)
hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE)
tx_id <- tx_names[as.character(seqnames(hit))]
cds_coordinate <- sapply(ranges(hit), '[[', 1)
if(length(tx_id) == 0 || length(cds_coordinate) == 0) {
out <- list('NaN', 0)
} else {
out <- list(tx_id, cds_coordinate)
}
return(out)
}
Then, I do:
counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]
And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:
Warning messages:
1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"), ... :
Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).
I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.
Then I tried this mock version of the function:
get_feature <- function(dt) {
return('I should be returned once for each row')
}
And called it like this:
new.table <- counts[, get_feature(.SD), by = .I]
It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?
Update (with solution): As @StatLearner pointed out, it is explained in this answer that, as explained in ?data.table
, .I
is only intended for use in j
(as in DT[i,j,by=]
). Therefore, by=.I
is equivalent to by=NULL
and the proper syntax is by=1:nrow(dt)
in order to group by row number and apply the function row-wise.
Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.
In my case, I had to give up and use the mapToTranscripts
function on the entire table like this, which takes a couple of seconds and was obviously the intended use.
get_features <- function(dt){
coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
tx_hit <- as.character(seqnames(hits)) # get transcript number
tx_id <- tx_names[tx_hit] # get transcript name from translation table
return(data.table('transcriptID'= tx_id,
'CDS_coordinate' = start(hits))
}
density <- counts[, get_features(.SD)]
Then map back to the genome using mapFromTranscripts
from GenomicFeatures
package so I could use a data.tables
join to retrieve information from the original table, which was the intended purpose of what I was trying to do.
The way I do it when I need to apply a function for each row in a data.table is grouping it by row number:
counts[, get_feature(.SD), by = 1:nrow(counts)]
As explained in this answer,.I
is not intended for using in by
since it should return the sequence of row indices that is produced by grouping. The reason why by = .I
doesn't throw an error is that data.table creates object .I
equals NULL
in data.table namespace, hence by = .I
is equivalent to by = NULL
.
Note that using by=1:nrow(dt)
groups by row number and allows your function to access only a single row from data.table:
require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
pause = sample((3 * 10^6):(3.2 * 10^6), size = 100),
strand = sample(c('-','+'), size = 100, replace = TRUE),
coverage = sample.int(3, size = 100, replace = TRUE))
get_feature <- function(dt){
coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
rowNum <- nrow(coordinate)
return(list(text = 'Number of rows in dt', rowNum = rowNum))
}
counts[, get_feature(.SD), by = 1:nrow(counts)]
will produce a data.table with the same number of rows as in counts
, but coordinate
will contain just a single row from counts
nrow text rowNum
1: 1 Number of rows in dt 1
2: 2 Number of rows in dt 1
3: 3 Number of rows in dt 1
4: 4 Number of rows in dt 1
5: 5 Number of rows in dt 1
while by = NULL
will supply the entire data.table to the function:
counts[, get_feature(.SD), by = NULL]
text rowNum
1: Number of rows in dt 100
which is the intended way for by
to work.