Search code examples
rdatetimelubridate

Is there a way in R to subset or fragment lubridate interval objects to count the days in subsetted intervals?


Dear Stackoverflow Community,

I'm currently working with a dataset that contains start and end dates for individual travelers. However, I'm only interested in measuring the duration of time in days that the travelers spent at a location between a given set of days.

For example:

library(lubridate)
traveler1_startdate <- mdy("3/11/2020")
traveler1_enddate <- mdy("7/1/2023")

Instead of:

length_stay <- traveler1_endate - traveler1_startdate

I'd like to know the total number of days the traveler stayed in the location between these dates: 5/1/2020 - 9/1/2020, 5/1/2021 - 9/1/2021, 5/1/2022 - 9/1/2022, 5/1/2023 - 9/1/2023

I couldn't figure out how to do this with either lubridate or base R dates. I saw one Stackoverflow reply used between() but that didn't seem to work for me at least with lubridate intervals or dateDiff objects. I was also thinking of a lookup function but couldn't figure out how to use that with dates either.

Thanks!


Solution

  • Here is an example. Consider you have a data.frame with travellers and their start and end dates called dat. And a second data.frame with time periods you would like to consider in your counting (periods)

    library(lubridate)
    library(dplyr)
    
    dat <- data.frame(
      id = 'traveler1',
      startdate = mdy("3/11/2020"),
      enddate = mdy("7/1/2023")
    )
    
    periods <- data.frame(
      period_start = mdy("5/1/2020", "5/1/2021", "5/1/2022", "5/1/2023"),
      period_end = mdy("9/1/2020", "9/1/2021", "9/1/2022", "9/1/2023")
    )
    

    Now, we can join these tables together, looking for overlaps in the periods and travel dates. Then, using min and max logic we can easily count the number of days falling within each period. Finally we sum over the different periods to get a single count per traveller.

    left_join(
      dat,
      periods,
      join_by(overlaps(startdate, enddate, period_start, period_end))
    ) |> 
      mutate(day_count = pmin(enddate, period_end) - pmax(startdate, period_start)) |> 
      summarise(day_count = sum(day_count), .by = c(id, startdate, enddate))
    

    Result:

             id  startdate    enddate day_count
    1 traveler1 2020-03-11 2023-07-01  430 days