So I have a dataframe as such:
ID key date1
001 02 2018-02-16
001 02 2018-02-19
001 03 2018-02-17
001 03 2018-02-22
001 04 2017-01-01
002 11 2019-12-21
002 12 2019-12-21
002 13 2019-12-22
And another dataframe (DF2)
ID key date2
001 02 2018-02-20
001 03 2018-03-22
002 13 2019-12-22
002 13 2019-12-21
So the task is conceptually straightforward:
I would like to find the previous date that exists in DF1 to each date in DF2.
What do I mean by that? In DF2 for example, we see a date of 2018-02-20
as well as a corresponding Key and ID. So I go to DF1 and find the matching ID and Key, this gives me two possibilities. I need the one that's the previous, so not after. Thus it would be 2018-02-19
. I will eventually calcuate the number of days.
The final df should look like this:
ID key date2 date1 day_diff
001 02 2018-02-20 2018-02-19 1
001 03 2018-03-22 2018-02-22 28
002 13 2019-12-22 2019-12-22 0
002 13 2019-12-21 NA NA
Again, we just need the date previous to the one that is in each row of DF2. This also needs to return NA if there is no previous date.
Does this work:
library(dplyr)
df1 %>% group_by(ID, key) %>% filter(date1 == max(date1)) %>%
fuzzyjoin::fuzzy_right_join(df2, by = c('ID' = 'ID', 'key' = 'key', 'date1' = 'date2'), match_fun = list(`==`, `==`, `<=`)) %>%
ungroup() %>% select('ID' = ID.y, 'key' = key.y, date2, date1) %>% mutate(day_diff = as.numeric(date2 - date1))
# A tibble: 4 x 5
ID key date2 date1 day_diff
<chr> <chr> <date> <date> <dbl>
1 001 02 2018-02-20 2018-02-19 1
2 001 03 2018-03-22 2018-02-22 28
3 002 13 2019-12-22 2019-12-22 0
4 002 13 2019-12-21 NA NA