Search code examples
rdatetimerangegrouping

How to group rows when time ranges overlap (with leeway) in R?


I have data recording people (personid) recieving different types of services (streamid) from a provider. Each period of service has a start and end date, the latter of which may be missing if their service is ongoing.

I need to group their rows where they have periods of ongoing service allowing for a day of gap between the end of the previous service and the start of the next. The data below has the target hand coded into the table.

library(tidyverse)
test <-  type_convert(tribble(
  ~personid, ~streamid, ~datetimestart, ~datetimeend, ~targetgroup,
          1,         1,   "2023-01-01", "2023-01-05",            1,
          1,         2,   "2023-01-07", "2023-01-30",            2,
                    
          2,         2,   "2023-12-01", NA_character_,           1, 
          2,         1,   "2024-01-12", "2024-01-30",            1,
          2,         3,   "2024-02-10", "2024-02-28",            1,
          2,         1,   "2024-02-25", NA_character_,           1,
                    
          3,         3,   "2023-12-01", "2024-01-14",            1, 
          3,         2,   "2024-01-12", "2024-01-30",            1,
          3,         1,   "2024-01-10", "2024-02-01",            1,
                    
          4,         3,   "2023-12-01", "2024-01-14",            1, 
          4,         2,   "2024-01-12", "2024-01-20",            1,
          4,         1,   "2024-01-21", NA_character_,           1
  
))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   datetimestart = col_date(format = ""),
#>   datetimeend = col_date(format = "")
#> )

Created on 2024-01-23 with reprex v2.1.0

I have been able to get 90% of where I want to using lags or coalesce, but situations like the personid == 2 group, where the overlap is between non-sequential rows (ordered by start date). Here's an example of my "best" outcomes so far.

test %>% 
  arrange(personid, datetimestart) %>%
  group_by(personid) %>%
  mutate(new_episode_group = datetimestart - lag(datetimeend) > days(1),
         new_episode_group = if_else(is.na(new_episode_group), FALSE, new_episode_group),
         group = cumsum(new_episode_group) + 1) %>% 
  select(-new_episode_group)
#> # A tibble: 12 × 6
#> # Groups:   personid [4]
#>    personid streamid datetimestart datetimeend targetgroup group
#>       <dbl>    <dbl> <date>        <date>            <dbl> <dbl>
#>  1        1        1 2023-01-01    2023-01-05            1     1
#>  2        1        2 2023-01-07    2023-01-30            2     2
#>  3        2        2 2023-12-01    NA                    1     1
#>  4        2        1 2024-01-12    2024-01-30            1     1
#>  5        2        3 2024-02-10    2024-02-28            1     2
#>  6        2        1 2024-02-25    NA                    1     2
#>  7        3        3 2023-12-01    2024-01-14            1     1
#>  8        3        1 2024-01-10    2024-02-01            1     1
#>  9        3        2 2024-01-12    2024-01-30            1     1
#> 10        4        3 2023-12-01    2024-01-14            1     1
#> 11        4        2 2024-01-12    2024-01-20            1     1
#> 12        4        1 2024-01-21    NA                    1     1

Created on 2024-01-23 with reprex v2.1.0


Solution

  • As far as I see, cummax is what you want.
    (NOTE: You need to convert date value into numeric because cummax can't treat date class).

    test %>% 
      replace_na(list(datetimeend = as.Date("2099-12-31"))) %>%  # filling with large value
      mutate(datetimeend_num = as.numeric(datetimeend)) %>%   # because cummax can't treat Date
      arrange(personid, datetimestart) %>% 
      group_by(personid) %>% 
      mutate(cummax_end = as.Date(cummax(datetimeend_num)), 
             new_episode_group = datetimestart - lag(cummax_end, default = cummax_end[1]) > days(1),
             group = cumsum(new_episode_group) + 1) %>% 
      ungroup() %>% 
      select(-c(datetimeend_num, cummax_end, new_episode_group))