Search code examples
rdatedata.tablelubridateranking

How to compute the ranking (of dates) by groups faster with data.table and lubridate?


I need to compute the ranks of dates by group. There are many small groups.

library(data.table)
library(lubridate)
library(microbenchmark)
set.seed(1)
NN <- 1000000
EE <- 10   
# Just an example.
todo <- data.table(id=paste0("ID",rep(1:NN, each=EE)), 
          val=dmy("1/1/1980") + sample(1:14000,NN*EE,replace=T))
# I want to benchmark this:
todo[,ord := frank(val, ties.method="first"), by=id]  

In order to compare it you can try with smaller NN, the timing is linear.

For NN = 1 million it takes 560 seconds.

Is there any way to do it faster?
I've been using lubridate but I can use any library you suggest.
In my real problem the number of rows within each ID is not constant.


Solution

  • I believe it is due to the overhead of calling frank multiple times for many small groups (the memory usage below should give you a hint on the bottleneck). Here is another option:

    DT1[order(id, val), ord := rowid(id)]
    

    timing code:

    library(data.table)
    set.seed(1L)
    NN <- 1e6
    EE <- 10
    todo <- data.table(id=paste0("ID",rep(1:NN, each=EE)),
        val=as.IDate("1980-01-01") + sample(1:14000,NN*EE,replace=T))
    DT0 <- copy(todo)
    DT1 <- copy(todo)
    
    bench::mark(
        todo[, ord := frank(val, ties.method="first"), by=id],
        DT0[, ord := rank(unclass(val), ties.method = "first"), by = id],
        DT1[order(id, val), ord := rowid(id)])
    
    all.equal(todo$ord, DT0$ord)  
    # [1] TRUE
    all.equal(todo$ord, DT1$ord)  
    # [1] TRUE
    

    timings:

      expression                                                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
      <bch:expr>                                                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
    1 todo[, `:=`(ord, frank(val, ties.method = "first")), by = id]        6.32m  6.32m   0.00264    15.7GB    0.177     1    67      6.32m <df[,~ <df[,~ <bch~
    2 DT0[, `:=`(ord, rank(unclass(val), ties.method = "first")), by = id] 1.12m  1.12m   0.0149     99.3MB    0.969     1    65      1.12m <df[,~ <df[,~ <bch~
    3 DT1[order(id, val), `:=`(ord, rowid(id))]                            7.85s  7.85s   0.127     236.8MB    0         1     0      7.85s <df[,~ <df[,~ <bch~
    

    It can be even faster if we remove id in order:

    DT1[order(val), ord := rowid(id)]
    

    timing code:

    bench::mark(DT0[order(id, val), ord := rowid(id)], 
        DT1[order(val), ord := rowid(id)])
    all.equal(DT0$ord, DT1$ord)
    # [1] TRUE
    

    timings:

    # A tibble: 2 x 13
      expression                                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory            time     gc              
      <bch:expr>                                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>            <list>   <list>          
    1 DT0[order(id, val), `:=`(ord, rowid(id))]    7.44s    7.44s     0.134     237MB        0     1     0      7.44s <df[,3] [10,000,000 x 3]> <df[,3] [15 x 3]> <bch:tm> <tibble [1 x 3]>
    2 DT1[order(val), `:=`(ord, rowid(id))]        4.66s    4.66s     0.215     237MB        0     1     0      4.66s <df[,3] [10,000,000 x 3]> <df[,3] [14 x 3]> <bch:tm> <tibble [1 x 3]>