Search code examples
rdata.tablenon-equi-join

How to join "out of range dates/nomatch" to nearest "date range" in non-equi R data.table joins?


I am trying to join two data tables with a data.table non-equi join. However, some of the dates in table 1, do not fall within a date range in table 2. What I want with these "nomatch", is to join them to the nearest date range. Lets says there are two ranges; 2002-01-01 to 2004-01-01 and another range; 2010-01-01 to 2012-01-01. The date 2009-01-01 will be out of range, however I want this date to be joined to the 2011-2012 range, as the year 2009 is closer to 2011, than the year 2004. I hope my question makes sense.

Example data

library(data.table)

# data.table with single dates
id <-  c("xxx","xxx","xxx","yyy","yyy","yyy","zzz","zzz","zzz")
date <- as.Date(c("2019-02-01","2020-05-01","2021-02-01","2020-01-01","2020-07-01","2020-08-01","2021-03-01","2022-10-01","2023-11-16"))
  
single_date_dt <- data.table(id,date)

# data.table with range dates
id <- c("xxx","xxx","yyy","zzz","zzz")
weight_kg <- c(10,13,20,45,65)
start_date <- as.Date(c("2019-01-01","2021-01-01","2020-01-01","2021-01-01","2023-01-01"))
end_date <- as.Date(c("2019-12-31","2021-12-31","2021-01-01","2021-12-31","2023-12-31"))

range_date_dt <- data.table(id,weight_kg,start_date,end_date) 
  
dates_joined <- range_date_dt[single_date_dt, on = .(id, start_date <= date, end_date >= date), nomatch = NA]

There are two NA's in the "dates_joined" data.table. The first NA should be "10" since 2020-05-01 is closer to 2019-12-31 than 2021-01-01, and the second NA should be 65, since 2022-10-01 is closer to 2023-01-01 than 2021-12-31.

The real dataset consist of several million observations with around 5000 "out of range" dates.

Thanks


Solution

  • I think we can safely assume the ranges won't overlap for the same id, so we can perform the rolling join using a melted range_date_dt:

    melt(range_date_dt, c("id", "weight_kg"), value.name = "date")[
      ,variable := NULL
    ][single_date_dt, on = .(id, date), roll = "nearest"]
    #>     id weight_kg       date
    #> 1: xxx        10 2019-02-01
    #> 2: xxx        10 2020-05-01
    #> 3: xxx        13 2021-02-01
    #> 4: yyy        20 2020-01-01
    #> 5: yyy        20 2020-07-01
    #> 6: yyy        20 2020-08-01
    #> 7: zzz        45 2021-03-01
    #> 8: zzz        65 2022-10-01
    #> 9: zzz        65 2023-11-16