Search code examples
rjoindplyrfuzzyjoin

R - Fuzzy Inner Join on two fields, matching to a date range


I'm fairly new to R, and have been sifting through other questions all morning trying to figure this out, but can't find anything related enough or my knowledge of R is not good enough to understand some of the suggested solutions to my problem.

I have two data frames, table A with a list of non-unique identifiers and a date, and table B with the same identifier field, and a start and end date outlining a 3 month date range. In my real data, I have 1.7m records in table A, and 1.6k records in table A (as well many other fields that i'll use for the final analysis). I am expecting the vast majority of records in table A to be unnecessary.

What I want to achieve is to join the two tables together, joining on the identifier, and then only joining if the date in table A falls inside the date range in table B. I want this as an inner join so I lose the unnecessary data.

Example tables:

a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))

b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))

Expected result (we lose person 3 because the date falls outside the range, person 2 has two records because they had two entries in table b with corresponding dates):

c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))

I've been basing my solution on this web page https://exploratory.io/note/exploratory/How-to-join-two-data-frames-with-date-ranges-moq8hEQ6, however my example adds in the requirement to also join the identifier.

My solution that logically makes sense to me based on my (limited) ability in R:

Joined <- fuzzy_inner_join(b, a, by = c("numberb"="numbera", "datex"="date1", "datex"="date2"),
                           match_fun = list("=", ">=", "<="))

However I get this error message:

Error in which(m) : argument to 'which' is not logical

Thank you in advance for any help here :)


Solution

  • data.table supports non-equi joins that are the fastest I've found. The syntax is a bit different from the tidyverse but I think it's worth it for these more complicated joins.

    a <- data.frame(numbera=c('1','2','3','1'),date1=as.Date(c('10/04/2021','21/06/2021','02/10/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','02/01/2022','17/05/2022'), format="%d/%m/%Y"))
    
    b <- data.frame(numberb=c('1','2','2','3','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','02/09/2021','17/03/2022'), format="%d/%m/%Y"))
    
    c <- data.frame(numberb=c('1','2','2','1'),datex=as.Date(c('16/05/2021','01/08/2021','03/08/2021','17/03/2022'), format="%d/%m/%Y"),numbera=c('1','2','2','1'),date1=as.Date(c('10/04/2021','21/06/2021','21/06/2021','17/02/2022'), format="%d/%m/%Y"),date2=as.Date(c('10/07/2021','21/09/2021','21/09/2021','17/05/2022'), format="%d/%m/%Y"))
    
    library(data.table)
    
    dt_a <- setDT(a) # change the data frame to a data table
    
    dt_b <- setDT(b) # change the data frame to a data table
    
    Joined <- dt_b[ # take table b
      dt_a, # join table a
      .(numberb, datex = x.datex, numbera, date1, date2), # selecting these columns
      on = .(numberb == numbera, datex>=date1, datex<=date2), # joining on these columns
      nomatch = NULL # remove non-matches for an inner join
    ] |> 
      setDF() # change it back to a data frame for comparison
    
    identical(c, Joined) # TRUE