Search code examples
rdata.tablerecycle

data.table efficient recycling V2


This is a follow-up to this question : data.table efficient recycling

The difference here is that the number of future years for each line is not necessarily the same ..

I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.

This can lead to something like that :

library(data.table)
dt <- data.table(1:500000, 500000:1, rpois(500000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :

library(data.table)
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

My question is : is there a more efficient way to achieve this purpose ?

Thanks for any help !


Solution

  • Here is slightly improved version of the other answer.

    • using non-default values to unlist
    • rep.int rather than rep
    • seq_len rather than :
    • setDT instead of data.table()
    • even better with sequence function suggested by @Cole
    • and further minor improvement with internal vecseq

    Together it seems to make difference.

    Timings...

    library(data.table)
    f0 = function(dt) {
      dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
    }
    f1 = function(dt) {
      dt2 <- data.table(
        rep(dt$V1, dt$V3),
        rep(dt$V2, dt$V3),
        rep(dt$V3, dt$V3),
        unlist(lapply(dt$V3, function(x){1:x}))
      )
      dt2
    }
    f2 = function(dt) {
      dt2 = list(
        V1 = rep.int(dt$V1, dt$V3),
        V2 = rep.int(dt$V2, dt$V3),
        V3 = rep.int(dt$V3, dt$V3),
        year = unlist(lapply(dt$V3, seq_len), recursive=FALSE, use.names=FALSE)
      )
      setDT(dt2)
      dt2
    }
    f3 = function(dt) {
      ## even better with sequence function suggested by @Cole
      dt2 = list(
        V1 = rep.int(dt$V1, dt$V3),
        V2 = rep.int(dt$V2, dt$V3),
        V3 = rep.int(dt$V3, dt$V3),
        year = sequence(dt$V3)
      )
      setDT(dt2)
      dt2
    }
    f4 = function(dt) {
      dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]
    }
    f5 = function(dt) {
      dt2 = list(
        V1 = rep.int(dt$V1, dt$V3),
        V2 = rep.int(dt$V2, dt$V3),
        V3 = rep.int(dt$V3, dt$V3),
        year = data.table:::vecseq(rep.int(1L,length(dt$V3)), dt$V3, NULL)
      )
      setDT(dt2)
      dt2
    }
    

    On a "big" data

    dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
    system.time(f0(dt))
    #   user  system elapsed 
    # 22.100  18.914  40.449 
    system.time(f1(dt))
    #   user  system elapsed 
    # 35.866  15.607  51.475 
    system.time(f2(dt))
    #   user  system elapsed 
    # 22.922   6.839  29.760 
    system.time(f3(dt))
    #   user  system elapsed 
    #  6.509   6.723  13.233 
    system.time(f4(dt))
    #   user  system elapsed 
    # 12.140  14.114  26.254 
    system.time(f5(dt))
    #   user  system elapsed 
    #  6.448   4.057  10.506 
    

    Anyway, you should try to improve your processes that you are running on expanded dataset because maybe you don't have to expand that in the first place.

    For example, in frollmean function there is an argument adaptive which makes it possible to calculate rolling mean on a variable length window, where normally to compute that one would need to expand data first. V3 in your data reminds a lot a length of a window for adaptive moving average.