How to subset one dataframe based on the number of days from a second dataframe in R

I have been working on this and looking for a solution but can't find anything that gets me exactly what I want. I have two data frames. The first data frame is structured like this

Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

My second dataframe is structed like this

Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")

I would like to remove observations in df1 that based on "ID" and "Collection" do not occur within 7 days (+/-) from any observation matching the same "ID" in df2

So ideally my output from the two examples would be in a new dataframe (df3) and look like this since my second observation (ID: 123456) in df1 is not within 7 days of of either Complete_Date with that ID in df2

Name              ID        Collection     
Doe, John         123456    2021-01-03
Smith, John       345678    2022-06-14

Solution

The typical way to do this would be using a "non-equi join," which is not currently supported in dplyr but is in the dev version. We can get around that (with some loss of efficiency) using a cartesian join that first connects each ID with all its matches in the 2nd table, then filters to just the ones within 7 days.

Prepare fake data

Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
df2<-data.frame(Number, ID, Complete_Date)

df1$Collection = as.Date(df1$Collection)
df2$Complete_Date = as.Date(df2$Complete_Date)

Cartesian join and filter using current CRAN dplyr 1.0.10:

df1 %>%
  left_join(df2, by = "ID") %>%
  group_by(Name, ID) %>%
  filter(abs(Collection - Complete_Date) <= 7) %>%
  ungroup() %>%
  distinct(Name, ID, Collection)

In the current development version of dplyr (I'm using 1.0.99.9000), there's support for non-equi joins like this, which I expect will be more efficient if each ID has many Complete_Date matches.

# devtools::install_github("tidyverse/dplyr")
df1 %>%
  mutate(Collection_early = Collection - 7, 
         Collection_late  = Collection + 7) %>%
  left_join(df2, join_by(ID,
                         Collection_early >= Complete_Date,
                         Collection_late  <= Complete_Date)) %>%
  distinct(Name, ID, Collection)