Search code examples
rdatetimeposixlt

How to join two dataframes by nearest time-date?


I have 2 data sets, each containing a date-time value in POSIXlt format, and some other numeric and character variables.

I want to combine both data sets based on the date-time column. But the date stamps of both data sets do not match, so I need to combine them by nearest date (before or after). In my example, data value "e" from 2016-03-01 23:52:00 needs to be combined with "binH" at 2016-03-02 00:00:00, not "binG".

Is there a function that would allow me to combine my data sets by nearest date-time value, even if it is after?

I have found ways of combining dates to the next previous date using the cut() function, or the roll=Inf function in data.tables. But I couldn't get my timestamps into any format roll='nearest' would accept.

    >df1
    date1 value
    1 2016-03-01 17:52:00     a
    2 2016-03-01 18:01:30     b
    3 2016-03-01 18:05:00     c
    4 2016-03-01 20:42:30     d
    5 2016-03-01 23:52:00     e

    >df2
    date2 bin_name
    1 2016-03-01 17:00:00     binA
    2 2016-03-01 18:00:00     binB
    3 2016-03-01 19:00:00     binC
    4 2016-03-01 20:00:00     binD
    5 2016-03-01 21:00:00     binE
    6 2016-03-01 22:00:00     binF
    7 2016-03-01 23:00:00     binG
    8 2016-03-02 00:00:00     binH
    9 2016-03-02 01:00:00     binI

Solution

  • data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

    (I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

    new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
    head( new, 2 )
    
                      date data.new
    1: 2016-03-02 12:20:00        t
    2: 2016-03-07 12:20:00        u
    
    old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
    head( old, 2 )
    
    
                      date data.old
    1: 2016-03-02 12:20:00        a
    2: 2016-03-07 12:20:00        b
    
    setkey( new, date )
    setkey( old, date )
    
    combined <- new[ old, roll = "nearest" ]
    combined
    
                      date data.new data.old
    1: 2015-03-02 12:20:00        t        d
    2: 2016-03-02 12:20:00        t        a
    3: 2016-03-07 12:20:00        u        b
    4: 2016-04-02 12:20:00        v        c
    

    I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

    combined <- old[ new, roll = "nearest" ]
    combined
    
                      date data.old data.new
    1: 2016-03-02 12:20:00        a        t
    2: 2016-03-07 12:20:00        b        u
    3: 2016-04-02 12:20:00        c        v