I have two datasets I would like to merge that looks like this
df1
ID|date | time|
1 04/06/21 "05:02:06"
2 05/03/21 "04:12:11"
3 02/02/20 "03:02:10"
4 09/09/20 "09:12:14"
5 02/02/21 "15:18:20"
6 04/04/21 "14:00:00"
df2
2ID|date | time|
1 04/06/21 "05:12:06"
2 05/03/21 "04:08:11"
3 02/02/20 "03:09:10"
4 09/09/20 "09:12:14"
5 02/02/21 "15:18:20"
6 04/04/21 "15:00:00"
Normally If I run a script based on exact match
df3 <- df2 %>% left_join(df1, by=c("incident_date","incident_time"))
I will get
ID| date|time |2ID
1 04/06/21 "05:02:06"
2 05/03/21 "04:12:11"
3 02/02/20 "03:02:10"
4 09/09/20 "09:12:14" 4
5 02/02/21 "15:18:20" 5
6 12/14/22 "14:00:00"
Note I will only match two out of four however the remaining three out of four variables that are close in time I would like to match. I would like the script to do a give or take by 45 minutes which would end up looking like this
ID| date|time |2ID
1 04/06/21 "05:02:06" 1
2 05/03/21 "04:12:11" 2
3 02/02/20 "03:02:10" 3
4 09/09/20 "09:12:14" 4
5 02/02/21 "15:18:20" 5
6 12/14/22 "14:00:00"
I attempted to do something like this based off an earlier stack overflow question in regards to date, but it was not able to work. Would anyone know how to do this
Source: Merge based on similar but not exact dates
df3< - df1%>%
left_join(df2, by=c("incident_date"), suffix
= c(".df1", ".df2")) %>%
filter(abs({strptime(df1$incident_time,format="%H:%M:%S")}.df1 -
{strptime(df2$incident_time,format="%H:%M:%S")}.df2) <= strptime("00:45:00",format="%H:%M:%S))
I think the fuzzyjoin
package is best for this.
I'll add a $tm
(POSIXct
) column to both frames, since that is needed to get a clear "calculated difference" (in seconds).
df1$tm <- as.POSIXct(paste(df1$date, df1$time), format="%m/%d/%Y %H:%M:%S")
df2$tm <- as.POSIXct(paste(df2$date, df2$time), format="%m/%d/%Y %H:%M:%S")
fuzzyjoin::difference_left_join(df1, df2, by = "tm", max_dist = 45*60)
# ID.x date.x time.x tm.x ID.y date.y time.y tm.y
# 1 1 04/06/21 05:02:06 0021-04-06 05:02:06 1 04/06/21 05:12:06 0021-04-06 05:12:06
# 2 2 05/03/21 04:12:11 0021-05-03 04:12:11 2 05/03/21 04:08:11 0021-05-03 04:08:11
# 3 3 02/02/20 03:02:10 0020-02-02 03:02:10 3 02/02/20 03:09:10 0020-02-02 03:09:10
# 4 4 09/09/20 09:12:14 0020-09-09 09:12:14 4 09/09/20 09:12:14 0020-09-09 09:12:14
# 5 5 02/02/21 15:18:20 0021-02-02 15:18:20 5 02/02/21 15:18:20 0021-02-02 15:18:20
# 6 6 04/04/21 14:00:00 0021-04-04 14:00:00 NA <NA> <NA> <NA>
Obviously there's a bunch of name cleanup required, how about this:
fuzzyjoin::difference_left_join(df1, df2[,c("ID","tm")], by = "tm", max_dist = 45*60) %>%
select(ID = ID.x, date, time, ID2 = ID.y)
# ID date time ID2
# 1 1 04/06/21 05:02:06 1
# 2 2 05/03/21 04:12:11 2
# 3 3 02/02/20 03:02:10 3
# 4 4 09/09/20 09:12:14 4
# 5 5 02/02/21 15:18:20 5
# 6 6 04/04/21 14:00:00 NA
Note: it is possible to find multiple matches (if multiple incidents are within 45 minutes), so you may need to add a grouped-filter:
... %>%
group_by(ID.x) %>%
filter(which.min(abs(tm.x - tm.y)))
(needs to be done before my renaming and removal of the tm.*
fields)
Data
df1 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:02:06", "04:12:11", "03:02:10", "09:12:14", "15:18:20", "14:00:00")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:12:06", "04:08:11", "03:09:10", "09:12:14", "15:18:20", "15:00:00")), class = "data.frame", row.names = c(NA, -6L))