Search code examples
rdata.tabletidyverse

Join Two Data Frames By Closest Date Without Going Over In R


I have two data frames of different lengths. Both data frames contain dates. I would like to inner join the closest date column without going over. For example, if "2019-01-05" is the date and the two choices are "2019-01-06" and "2019-01-02", it joins with "2019-01-02". I tried using roll "inf" from data.table package but I am not getting the desired result. Here are 2 toy data frames that I want to join by date.

I would also like to group the join by the ticker column as well. In tidyverse, the code would be similar to:

df1 %>% group_by(ticker) %>% inner_join(df2, by = "Date") #but choose the closest date without going over.


df1 = data.frame(ticker = c("AAPL", "AAPL", "MSFT", "MSFT"), date = c("2019-01-06", "2019-02-06", "2019-01-06", "2019-05-02"))

df2 = data.frame(ticker = c("AAPL", "AAPL", "AAPL", "MSFT", "MSFT", "MSFT"), date = c("2019-01-03", "2019-01-07" , "2019-02-06", "2019-01-05", "2019-01-07", "2019-05-02"), randomVar = rnorm(6))

print(df1)
 ticker       date
1   AAPL 2019-01-06
2   AAPL 2019-02-06
3   MSFT 2019-01-06
4   MSFT 2019-05-02


print(df2)
 ticker       date  randomVar
1   AAPL 2019-01-03 -0.5321493
2   AAPL 2019-01-07 -0.7909461
3   AAPL 2019-02-06  0.2121993
4   MSFT 2019-01-05  1.2336315
5   MSFT 2019-01-07 -0.2729354
6   MSFT 2019-05-02 -0.5349596

I would like to create df3 which is a join between df1 and df2. The date key column should only be the ones from the df1 date column.

In this case, our new df3 should look exactly like this.

 print(df3)

 ticker       date   randomVar
1   AAPL 2019-01-06  -0.5321493
2   AAPL 2019-02-06   0.2121993
3   MSFT 2019-01-06   1.2336315
4   MSFT 2019-05-02   -0.5349596

Solution

  • This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.

    library(sqldf)
    sqldf("select df1.*, max(df2.date), df2.randomVar from df1 
      left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
      group by df1.rowid
      order by df1.rowid")[-3]
    

    giving:

      ticker       date  randomVar
    1   AAPL 2019-01-06 -0.5321493
    2   AAPL 2019-02-06  0.2121993
    3   MSFT 2019-01-06  1.2336315
    4   MSFT 2019-05-02 -0.5349596
    

    Note

    Inputs in reproducible form:

    Lines1 <- "ticker       date
    1   AAPL 2019-01-06
    2   AAPL 2019-02-06
    3   MSFT 2019-01-06
    4   MSFT 2019-05-02"
    
    
    Lines2 <- "ticker       date  randomVar
    1   AAPL 2019-01-03 -0.5321493
    2   AAPL 2019-01-07 -0.7909461
    3   AAPL 2019-02-06  0.2121993
    4   MSFT 2019-01-05  1.2336315
    5   MSFT 2019-01-07 -0.2729354
    6   MSFT 2019-05-02 -0.5349596"
    
    df1 <- read.table(text = Lines1, as.is = TRUE)
    df2 <- read.table(text = Lines2, as.is = TRUE)