Search code examples
rdplyrsubset

How to subset one dataframe based on the number of days from a second dataframe in R


I have been working on this and looking for a solution but can't find anything that gets me exactly what I want. I have two data frames. The first data frame is structured like this

Name <- c("Doe, John", "Doe, John", "Smith, John")
ID <- c("123456", "123456", "345678")
Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
df1<-data.frame(Name, ID, Collection)

My second dataframe is structed like this

Number<- c("M123", "M456", "M367")
ID <- c("123456", "123456", "345678")
Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")

I would like to remove observations in df1 that based on "ID" and "Collection" do not occur within 7 days (+/-) from any observation matching the same "ID" in df2

So ideally my output from the two examples would be in a new dataframe (df3) and look like this since my second observation (ID: 123456) in df1 is not within 7 days of of either Complete_Date with that ID in df2

Name              ID        Collection     
Doe, John         123456    2021-01-03
Smith, John       345678    2022-06-14

Solution

  • The typical way to do this would be using a "non-equi join," which is not currently supported in dplyr but is in the dev version. We can get around that (with some loss of efficiency) using a cartesian join that first connects each ID with all its matches in the 2nd table, then filters to just the ones within 7 days.


    Prepare fake data

    Name <- c("Doe, John", "Doe, John", "Smith, John")
    ID <- c("123456", "123456", "345678")
    Collection <- c("2021-01-03", "2022-05-01", "2022-06-14")
    df1<-data.frame(Name, ID, Collection)
    
    Number<- c("M123", "M456", "M367")
    ID <- c("123456", "123456", "345678")
    Complete_Date <- c("2021-01-05", "2022-06-01", "2022-06-12")
    df2<-data.frame(Number, ID, Complete_Date)
    
    df1$Collection = as.Date(df1$Collection)
    df2$Complete_Date = as.Date(df2$Complete_Date)
    

    Cartesian join and filter using current CRAN dplyr 1.0.10:

    df1 %>%
      left_join(df2, by = "ID") %>%
      group_by(Name, ID) %>%
      filter(abs(Collection - Complete_Date) <= 7) %>%
      ungroup() %>%
      distinct(Name, ID, Collection)
    

    In the current development version of dplyr (I'm using 1.0.99.9000), there's support for non-equi joins like this, which I expect will be more efficient if each ID has many Complete_Date matches.

    # devtools::install_github("tidyverse/dplyr")
    df1 %>%
      mutate(Collection_early = Collection - 7, 
             Collection_late  = Collection + 7) %>%
      left_join(df2, join_by(ID,
                             Collection_early >= Complete_Date,
                             Collection_late  <= Complete_Date)) %>%
      distinct(Name, ID, Collection)