Search code examples
rdata.table

Efficient way of row binding time series in a data.table, with correctly sorted timestamps


Is there a more efficient way to row bind (or an efficient merge) two or more massive time series with data table? The time series have some different columns, so I use fill = TRUE.

I want all the rows in each time series to appear in the final data.table. I can do it below, but the time series stamps are not ordered in dt3 below. I have to create dt4 to get the ordered stamps.

I wonder if there is a more efficient way of doing a kind of rbind/time series merge in data.table?

library(data.table)

tm <- seq(as.POSIXct("2018-05-12 00:00"), as.POSIXct("2018-05-14"), by = "hours")
dt <- data.table(time = tm, x = seq(1, length(tm), by = 1))

set.seed(1)

dt2 <- data.table(time = tm[sample(length(tm), size = 8)] + rnorm(n = 8, 0, 60),
                 y = rnorm(8))

# Can a one liner here get me the output in `dt4` with some kind of row bind? 
#  Is there a way to do a row bind here instead that avoids the creation of a new object dt4 that takes the sorted rows?

dt3 <- rbind(dt, dt2, fill = TRUE)

dt4 <- dt3[order(time)]

                      tail(dt4, 20)
#                   time  x           y
# 1: 2018-05-13 08:00:00 33          NA
# 2: 2018-05-13 09:00:00 34          NA
# 3: 2018-05-13 10:00:00 35          NA
# 4: 2018-05-13 11:00:00 36          NA
# 5: 2018-05-13 12:00:00 37          NA
# 6: 2018-05-13 13:00:00 38          NA
# 7: 2018-05-13 14:00:00 39          NA
# 8: 2018-05-13 14:59:41 NA  0.94383621
# 9: 2018-05-13 15:00:00 40          NA
# 10: 2018-05-13 16:00:00 41          NA
# 11: 2018-05-13 16:01:30 NA  0.82122120
# 12: 2018-05-13 17:00:00 42          NA
# 13: 2018-05-13 17:00:44 NA -0.04493361
# 14: 2018-05-13 18:00:00 43          NA
# 15: 2018-05-13 19:00:00 44          NA
# 16: 2018-05-13 20:00:00 45          NA
# 17: 2018-05-13 21:00:00 46          NA
# 18: 2018-05-13 22:00:00 47          NA
# 19: 2018-05-13 23:00:00 48          NA
# 20: 2018-05-14 00:00:00 49          NA

Solution

  • If you have the time columns set as keys

    setkey(dt, time)
    setkey(dt2, time)
    

    Then you can use merge.data.table

    merge(dt,dt2,all=TRUE)
    

    Note, if the time series are already known to be sorted (which dt is, but dt2 is not), you can speed up a bit more by just setting the 'sorted' attribute of the data.tables, rather than calling setkey.

    setattr(dt, "sorted", "time")