I am working with a dataframe where each row is indexed by a date. These are ordered but incomplete, meaning that that there is a subset of uninterrupted sequential rows in the dataset. I am interested in grouping the rows that belong to a sequence and finding out the length (number of days) of each sequence. Rows that are not part of a sequence are disregarded (or could have length of 1). Consider the following toy data:
library(tidyverse)
library(lubridate)
# toy data
df <- tibble(day = as_date(c("2022-01-01", "2022-01-03", "2022-01-04",
"2022-01-05", "2022-01-08","2022-01-10",
"2022-01-11", "2022-01-12","2022-01-13")),
values = 1:9)
What I am trying to obtain is an aggregated dataframe that tells me each uninterrupted sequence of days in the data and their length such as this:
# ideal output
tibble(sequences = c("2022-01-03 - 2022-01-05", "2022-01-10 - 2022-01-12"),
length = c(3, 4))
So far I have calculated the differences in days between each row by using lags and leads and then turned this into an additional column indicating whether a row is the start, mid or endpoint of an uninterrupted sequence of days:
# transform df
df %>%
# add lag and lead to find day differences
mutate(last_day = lag(day),
next_day = lead(day),
diff_last = day - last_day,
diff_next = next_day - day) %>%
# flag days in sequence
mutate(in_seq = if_else(
condition = diff_last == 1| diff_next == 1,
true = TRUE,
false = FALSE)) %>%
# find position of days in sequence
mutate(seq_position = case_when(
diff_last > 1 & diff_next == 1 ~ "start",
diff_last == 1 & diff_next > 1 ~ "end",
diff_last == 1 & diff_next == 1 ~ "mid",
is.na(diff_last) & diff_next == 1 ~ "start",
diff_last == 1 & is.na(diff_next) ~ "end",
is.na(diff_last) & is.na(diff_next) ~ NA_character_,
TRUE ~ "none"
))
This is where I'm stuck. I suspect that I need to create an additional variable for grouping that contains the start and end of each sequence so that I can summarise the number of rows per group, but I do not know a way for creating such groups since it involves creating a string based on values from multiple columns.
You could do:
df %>%
group_by(cumsum(c(0, diff(day) - 1))) %>%
summarise(sequences = paste(first(day), last(day), sep = ' - '),
length = n()) %>%
filter(length > 1) %>%
select(sequences, length)
#> # A tibble: 2 x 2
#> sequences length
#> <chr> <int>
#> 1 2022-01-03 - 2022-01-05 3
#> 2 2022-01-10 - 2022-01-13 4