Search code examples
rdatedplyrlubridate

Matching two date fields using dplyr


So I have a dataframe as such:

ID   key  date1
001   02  2018-02-16
001   02  2018-02-19
001   03  2018-02-17
001   03  2018-02-22
001   04  2017-01-01
002   11  2019-12-21
002   12  2019-12-21
002   13  2019-12-22

And another dataframe (DF2)

ID   key  date2
001   02  2018-02-20
001   03  2018-03-22
002   13  2019-12-22
002   13  2019-12-21

So the task is conceptually straightforward:

I would like to find the previous date that exists in DF1 to each date in DF2.

What do I mean by that? In DF2 for example, we see a date of 2018-02-20 as well as a corresponding Key and ID. So I go to DF1 and find the matching ID and Key, this gives me two possibilities. I need the one that's the previous, so not after. Thus it would be 2018-02-19. I will eventually calcuate the number of days.

The final df should look like this:

ID   key  date2           date1   day_diff
001   02  2018-02-20 2018-02-19          1
001   03  2018-03-22 2018-02-22         28
002   13  2019-12-22 2019-12-22          0
002   13  2019-12-21         NA         NA

Again, we just need the date previous to the one that is in each row of DF2. This also needs to return NA if there is no previous date.


Solution

  • Does this work:

    library(dplyr)
    df1 %>% group_by(ID, key) %>% filter(date1 == max(date1)) %>% 
    fuzzyjoin::fuzzy_right_join(df2, by = c('ID' = 'ID', 'key' = 'key', 'date1' = 'date2'), match_fun = list(`==`, `==`, `<=`)) %>% 
    ungroup() %>% select('ID' = ID.y, 'key' = key.y, date2, date1) %>% mutate(day_diff = as.numeric(date2 - date1))
    # A tibble: 4 x 5
      ID    key   date2      date1      day_diff
      <chr> <chr> <date>     <date>        <dbl>
    1 001   02    2018-02-20 2018-02-19        1
    2 001   03    2018-03-22 2018-02-22       28
    3 002   13    2019-12-22 2019-12-22        0
    4 002   13    2019-12-21 NA               NA