Search code examples
rdplyrlubridate

How can I more efficiently assign a value based on data ranges?


I have some events with specific dates and date ranges that should be used to add a new value to the events based on which date range they occurred. I know I can accomplish this with the code below, but it does not scale well. It seems like there should be a way to vectorize it.

Here's the representative data

events
#>   event_name event_date
#> 1    event a 2012-05-23
#> 2    event b 2018-03-12
#> 3    event c 2019-08-17
#> 4    event d 2022-01-03

leader_tenure
#>    leader      start        end
#> 1 Manning 2012-04-09 2017-12-30
#> 2    Hale 2017-12-30 2019-11-09
#> 3 Whitney 2019-11-10 2022-05-30

Here's the working but inefficent and difficult-to-scale code

library(dplyr)
library(lubridate)

events <- data.frame(
  event_name = c("event a", "event b", "event c", "event d"),
  event_date = ymd(c("2012-05-23", "2018-03-12", "2019-08-17", "2022-01-03"
  ))
)

leader_tenure <- data.frame(
  leader = c("Manning", "Hale", "Whitney"),
  start = ymd(c("2012-04-09", "2017-12-30", "2019-11-10")),
  end = ymd(c("2017-12-30", "2019-11-09", "2022-05-30"))
)

events |>
  mutate(
    leader = case_when(
      event_date %within% interval(ymd("2012-04-09"), ymd("2017-12-30")) ~ "Manning",
      event_date %within% interval(ymd("2017-12-30"), ymd("2019-11-09")) ~ "Hale",
      event_date %within% interval(ymd("2019-11-10"), ymd("2022-05-30")) ~ "Whitney"
    )
  )
#>   event_name event_date  leader
#> 1    event a 2012-05-23 Manning
#> 2    event b 2018-03-12    Hale
#> 3    event c 2019-08-17    Hale
#> 4    event d 2022-01-03 Whitney

Solution

  • This is a good use case for a join, which is an efficient way to relate data between two tables. Since v1.1.0 in early 2023, dplyr has supported a variety of non-equi joins, including "overlap joins" like this. https://dplyr.tidyverse.org/reference/join_by.html

    You might want to think about what to do in case multiple leaders could be matched to a given event. The code below would add a row for each. But if that situation suggests a data quality problem, you could specify relationship = "one-to-one" within left_join to make sure it produces an error so you can fix it before you continue with other steps. Or if it's ok that multiple leaders could match, but you only want want to show one leader for each event, you could specify multiple = "first" or multiple = "any" to get only one for each event. (Or if there's a heuristic you want to apply to pick one, you could join to all of them, then use arrange+filter or slice_min/slice_max to pick the best.)

    events |>
      left_join(leader_tenure, join_by(event_date |> between(start, end)))
    
      event_name event_date  leader      start        end
    1    event a 2012-05-23 Manning 2012-04-09 2017-12-30
    2    event b 2018-03-12    Hale 2017-12-30 2019-11-09
    3    event c 2019-08-17    Hale 2017-12-30 2019-11-09
    4    event d 2022-01-03 Whitney 2019-11-10 2022-05-30