Search code examples
rmergestrptime

How to merge two datasets based on similar but not exact time variable written in string. using R?


I have two datasets I would like to merge that looks like this

df1
ID|date    |       time|
 1 04/06/21   "05:02:06"
 2 05/03/21   "04:12:11"
 3 02/02/20   "03:02:10"
 4 09/09/20   "09:12:14"
 5 02/02/21   "15:18:20" 
 6 04/04/21   "14:00:00"  

df2
2ID|date    |       time|
 1 04/06/21   "05:12:06"
 2 05/03/21   "04:08:11"
 3 02/02/20   "03:09:10"
 4 09/09/20   "09:12:14"
 5 02/02/21   "15:18:20" 
 6 04/04/21   "15:00:00"

Normally If I run a script based on exact match

df3 <- df2 %>% left_join(df1, by=c("incident_date","incident_time"))

I will get

ID|    date|time    |2ID
1  04/06/21 "05:02:06" 
2  05/03/21 "04:12:11"
3  02/02/20 "03:02:10"
4  09/09/20 "09:12:14" 4
5  02/02/21 "15:18:20" 5
6  12/14/22 "14:00:00"

Note I will only match two out of four however the remaining three out of four variables that are close in time I would like to match. I would like the script to do a give or take by 45 minutes which would end up looking like this

ID|    date|time    |2ID
1  04/06/21 "05:02:06" 1
2  05/03/21 "04:12:11" 2
3  02/02/20 "03:02:10" 3
4  09/09/20 "09:12:14" 4
5  02/02/21 "15:18:20" 5
6  12/14/22 "14:00:00"

I attempted to do something like this based off an earlier stack overflow question in regards to date, but it was not able to work. Would anyone know how to do this

Source: Merge based on similar but not exact dates

df3< - df1%>%
left_join(df2, by=c("incident_date"), suffix 
= c(".df1", ".df2")) %>%
filter(abs({strptime(df1$incident_time,format="%H:%M:%S")}.df1 - 
{strptime(df2$incident_time,format="%H:%M:%S")}.df2) <= strptime("00:45:00",format="%H:%M:%S))

Solution

  • I think the fuzzyjoin package is best for this.

    I'll add a $tm (POSIXct) column to both frames, since that is needed to get a clear "calculated difference" (in seconds).

    df1$tm <- as.POSIXct(paste(df1$date, df1$time), format="%m/%d/%Y %H:%M:%S")
    df2$tm <- as.POSIXct(paste(df2$date, df2$time), format="%m/%d/%Y %H:%M:%S")
    fuzzyjoin::difference_left_join(df1, df2, by = "tm", max_dist = 45*60)
    #   ID.x   date.x   time.x                tm.x ID.y   date.y   time.y                tm.y
    # 1    1 04/06/21 05:02:06 0021-04-06 05:02:06    1 04/06/21 05:12:06 0021-04-06 05:12:06
    # 2    2 05/03/21 04:12:11 0021-05-03 04:12:11    2 05/03/21 04:08:11 0021-05-03 04:08:11
    # 3    3 02/02/20 03:02:10 0020-02-02 03:02:10    3 02/02/20 03:09:10 0020-02-02 03:09:10
    # 4    4 09/09/20 09:12:14 0020-09-09 09:12:14    4 09/09/20 09:12:14 0020-09-09 09:12:14
    # 5    5 02/02/21 15:18:20 0021-02-02 15:18:20    5 02/02/21 15:18:20 0021-02-02 15:18:20
    # 6    6 04/04/21 14:00:00 0021-04-04 14:00:00   NA     <NA>     <NA>                <NA>
    

    Obviously there's a bunch of name cleanup required, how about this:

    fuzzyjoin::difference_left_join(df1, df2[,c("ID","tm")], by = "tm", max_dist = 45*60) %>%
      select(ID = ID.x, date, time, ID2 = ID.y)
    #   ID     date     time ID2
    # 1  1 04/06/21 05:02:06   1
    # 2  2 05/03/21 04:12:11   2
    # 3  3 02/02/20 03:02:10   3
    # 4  4 09/09/20 09:12:14   4
    # 5  5 02/02/21 15:18:20   5
    # 6  6 04/04/21 14:00:00  NA
    

    Note: it is possible to find multiple matches (if multiple incidents are within 45 minutes), so you may need to add a grouped-filter:

    ... %>%
      group_by(ID.x) %>%
      filter(which.min(abs(tm.x - tm.y)))
    

    (needs to be done before my renaming and removal of the tm.* fields)


    Data

    df1 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:02:06", "04:12:11", "03:02:10", "09:12:14", "15:18:20", "14:00:00")), class = "data.frame", row.names = c(NA, -6L))
    df2 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:12:06", "04:08:11", "03:09:10", "09:12:14", "15:18:20", "15:00:00")), class = "data.frame", row.names = c(NA, -6L))