I have data recording people (personid) recieving different types of services (streamid) from a provider. Each period of service has a start and end date, the latter of which may be missing if their service is ongoing.
I need to group their rows where they have periods of ongoing service allowing for a day of gap between the end of the previous service and the start of the next. The data below has the target hand coded into the table.
library(tidyverse)
test <- type_convert(tribble(
~personid, ~streamid, ~datetimestart, ~datetimeend, ~targetgroup,
1, 1, "2023-01-01", "2023-01-05", 1,
1, 2, "2023-01-07", "2023-01-30", 2,
2, 2, "2023-12-01", NA_character_, 1,
2, 1, "2024-01-12", "2024-01-30", 1,
2, 3, "2024-02-10", "2024-02-28", 1,
2, 1, "2024-02-25", NA_character_, 1,
3, 3, "2023-12-01", "2024-01-14", 1,
3, 2, "2024-01-12", "2024-01-30", 1,
3, 1, "2024-01-10", "2024-02-01", 1,
4, 3, "2023-12-01", "2024-01-14", 1,
4, 2, "2024-01-12", "2024-01-20", 1,
4, 1, "2024-01-21", NA_character_, 1
))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> datetimestart = col_date(format = ""),
#> datetimeend = col_date(format = "")
#> )
Created on 2024-01-23 with reprex v2.1.0
I have been able to get 90% of where I want to using lags or coalesce, but situations like the personid == 2 group, where the overlap is between non-sequential rows (ordered by start date). Here's an example of my "best" outcomes so far.
test %>%
arrange(personid, datetimestart) %>%
group_by(personid) %>%
mutate(new_episode_group = datetimestart - lag(datetimeend) > days(1),
new_episode_group = if_else(is.na(new_episode_group), FALSE, new_episode_group),
group = cumsum(new_episode_group) + 1) %>%
select(-new_episode_group)
#> # A tibble: 12 × 6
#> # Groups: personid [4]
#> personid streamid datetimestart datetimeend targetgroup group
#> <dbl> <dbl> <date> <date> <dbl> <dbl>
#> 1 1 1 2023-01-01 2023-01-05 1 1
#> 2 1 2 2023-01-07 2023-01-30 2 2
#> 3 2 2 2023-12-01 NA 1 1
#> 4 2 1 2024-01-12 2024-01-30 1 1
#> 5 2 3 2024-02-10 2024-02-28 1 2
#> 6 2 1 2024-02-25 NA 1 2
#> 7 3 3 2023-12-01 2024-01-14 1 1
#> 8 3 1 2024-01-10 2024-02-01 1 1
#> 9 3 2 2024-01-12 2024-01-30 1 1
#> 10 4 3 2023-12-01 2024-01-14 1 1
#> 11 4 2 2024-01-12 2024-01-20 1 1
#> 12 4 1 2024-01-21 NA 1 1
Created on 2024-01-23 with reprex v2.1.0
As far as I see, cummax
is what you want.
(NOTE: You need to convert date value into numeric because cummax
can't treat date class).
test %>%
replace_na(list(datetimeend = as.Date("2099-12-31"))) %>% # filling with large value
mutate(datetimeend_num = as.numeric(datetimeend)) %>% # because cummax can't treat Date
arrange(personid, datetimestart) %>%
group_by(personid) %>%
mutate(cummax_end = as.Date(cummax(datetimeend_num)),
new_episode_group = datetimestart - lag(cummax_end, default = cummax_end[1]) > days(1),
group = cumsum(new_episode_group) + 1) %>%
ungroup() %>%
select(-c(datetimeend_num, cummax_end, new_episode_group))