Search code examples
rdata.tablegenomicranges

Applying custom function to data.table by row returns incorrect amount of values


I'm kind of new to data.tables and I have a table containing DNA genomic coordinates like this:

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

I wrote a custom function that I want to apply to each row of my circa 2 million rows table, it uses GenomicFeatures' mapToTranscripts to retrieve two related values in form of a string and a new coordinate. I want to add them to my table in two new columns, like this:

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

The function is the following:

    get_feature <- function(dt){

      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
      hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
      tx_id <- tx_names[as.character(seqnames(hit))] 
      cds_coordinate <- sapply(ranges(hit), '[[', 1)

      if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
        out <- list('NaN', 0)
      } else {
        out <- list(tx_id, cds_coordinate)
      }

      return(out)
    } 

Then, I do:

    counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I] 

And I get this error, indicating that the function is returning two lists of shorter length than the original table, instead of one new element per row:

Warning messages:
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

I assumed that using the .I operator would apply the function on a row basis and return one value per row. I also made sure the function was not returning empty values using the if statement.

Then I tried this mock version of the function:

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

And called it like this:

    new.table <- counts[, get_feature(.SD), by = .I] 

It makes a 1 row data table, instead of one the original length. So I concluded that my function, or maybe the way I'm calling it, is collapsing the elements of the resulting vector somehow. What am I doing wrong?

Update (with solution): As @StatLearner pointed out, it is explained in this answer that, as explained in ?data.table, .I is only intended for use in j (as in DT[i,j,by=]). Therefore, by=.I is equivalent to by=NULL and the proper syntax is by=1:nrow(dt) in order to group by row number and apply the function row-wise.

Unfortunately, for my particular case, this is utterly inefficient and I calculated an execution time of 20 seconds for 100 rows. For my 36 million row dataset that takes 3 months to complete.

In my case, I had to give up and use the mapToTranscripts function on the entire table like this, which takes a couple of seconds and was obviously the intended use.

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))
    }

     density <- counts[, get_features(.SD)]

Then map back to the genome using mapFromTranscripts from GenomicFeatures package so I could use a data.tables join to retrieve information from the original table, which was the intended purpose of what I was trying to do.


Solution

  • The way I do it when I need to apply a function for each row in a data.table is grouping it by row number:

    counts[, get_feature(.SD), by = 1:nrow(counts)]
    

    As explained in this answer,.I is not intended for using in by since it should return the sequence of row indices that is produced by grouping. The reason why by = .I doesn't throw an error is that data.table creates object .I equals NULL in data.table namespace, hence by = .I is equivalent to by = NULL.

    Note that using by=1:nrow(dt) groups by row number and allows your function to access only a single row from data.table:

    require(data.table)
    counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                         pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                         strand = sample(c('-','+'), size = 100, replace = TRUE),
                         coverage = sample.int(3, size = 100, replace = TRUE))
    
    get_feature <- function(dt){
        coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
        rowNum <- nrow(coordinate)
        return(list(text = 'Number of rows in dt', rowNum = rowNum))  
    }
    
    counts[, get_feature(.SD), by = 1:nrow(counts)]
    

    will produce a data.table with the same number of rows as in counts, but coordinate will contain just a single row from counts

       nrow                 text rowNum
    1:    1 Number of rows in dt      1
    2:    2 Number of rows in dt      1
    3:    3 Number of rows in dt      1
    4:    4 Number of rows in dt      1
    5:    5 Number of rows in dt      1
    

    while by = NULL will supply the entire data.table to the function:

    counts[, get_feature(.SD), by = NULL]
    
                       text rowNum
    1: Number of rows in dt    100
    

    which is the intended way for by to work.