I have two data frames of different lengths. Both data frames contain dates. I would like to inner join the closest date column without going over. For example, if "2019-01-05" is the date and the two choices are "2019-01-06" and "2019-01-02", it joins with "2019-01-02". I tried using roll "inf" from data.table package but I am not getting the desired result. Here are 2 toy data frames that I want to join by date.
I would also like to group the join by the ticker column as well. In tidyverse, the code would be similar to:
df1 %>% group_by(ticker) %>% inner_join(df2, by = "Date") #but choose the closest date without going over.
df1 = data.frame(ticker = c("AAPL", "AAPL", "MSFT", "MSFT"), date = c("2019-01-06", "2019-02-06", "2019-01-06", "2019-05-02"))
df2 = data.frame(ticker = c("AAPL", "AAPL", "AAPL", "MSFT", "MSFT", "MSFT"), date = c("2019-01-03", "2019-01-07" , "2019-02-06", "2019-01-05", "2019-01-07", "2019-05-02"), randomVar = rnorm(6))
print(df1)
ticker date
1 AAPL 2019-01-06
2 AAPL 2019-02-06
3 MSFT 2019-01-06
4 MSFT 2019-05-02
print(df2)
ticker date randomVar
1 AAPL 2019-01-03 -0.5321493
2 AAPL 2019-01-07 -0.7909461
3 AAPL 2019-02-06 0.2121993
4 MSFT 2019-01-05 1.2336315
5 MSFT 2019-01-07 -0.2729354
6 MSFT 2019-05-02 -0.5349596
I would like to create df3 which is a join between df1 and df2. The date key column should only be the ones from the df1 date column.
In this case, our new df3 should look exactly like this.
print(df3)
ticker date randomVar
1 AAPL 2019-01-06 -0.5321493
2 AAPL 2019-02-06 0.2121993
3 MSFT 2019-01-06 1.2336315
4 MSFT 2019-05-02 -0.5349596
This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.
library(sqldf)
sqldf("select df1.*, max(df2.date), df2.randomVar from df1
left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
group by df1.rowid
order by df1.rowid")[-3]
giving:
ticker date randomVar
1 AAPL 2019-01-06 -0.5321493
2 AAPL 2019-02-06 0.2121993
3 MSFT 2019-01-06 1.2336315
4 MSFT 2019-05-02 -0.5349596
Inputs in reproducible form:
Lines1 <- "ticker date
1 AAPL 2019-01-06
2 AAPL 2019-02-06
3 MSFT 2019-01-06
4 MSFT 2019-05-02"
Lines2 <- "ticker date randomVar
1 AAPL 2019-01-03 -0.5321493
2 AAPL 2019-01-07 -0.7909461
3 AAPL 2019-02-06 0.2121993
4 MSFT 2019-01-05 1.2336315
5 MSFT 2019-01-07 -0.2729354
6 MSFT 2019-05-02 -0.5349596"
df1 <- read.table(text = Lines1, as.is = TRUE)
df2 <- read.table(text = Lines2, as.is = TRUE)