Search code examples
rdplyrdata-manipulation

Finding the number of consecutive days in data


I am working with a dataframe where each row is indexed by a date. These are ordered but incomplete, meaning that that there is a subset of uninterrupted sequential rows in the dataset. I am interested in grouping the rows that belong to a sequence and finding out the length (number of days) of each sequence. Rows that are not part of a sequence are disregarded (or could have length of 1). Consider the following toy data:

library(tidyverse)
library(lubridate)

# toy data
df <- tibble(day = as_date(c("2022-01-01", "2022-01-03", "2022-01-04", 
                             "2022-01-05", "2022-01-08","2022-01-10",
                             "2022-01-11", "2022-01-12","2022-01-13")),
             values = 1:9)

What I am trying to obtain is an aggregated dataframe that tells me each uninterrupted sequence of days in the data and their length such as this:

# ideal output
tibble(sequences = c("2022-01-03 - 2022-01-05", "2022-01-10 - 2022-01-12"),
       length = c(3, 4))

So far I have calculated the differences in days between each row by using lags and leads and then turned this into an additional column indicating whether a row is the start, mid or endpoint of an uninterrupted sequence of days:

# transform df
df %>% 
  # add lag and lead to find day differences
  mutate(last_day = lag(day), 
         next_day = lead(day),
         diff_last = day - last_day,
         diff_next = next_day - day) %>% 
  # flag days in sequence
  mutate(in_seq = if_else(
    condition = diff_last == 1| diff_next == 1, 
    true = TRUE, 
    false = FALSE)) %>% 
  # find position of days in sequence
  mutate(seq_position = case_when(
    diff_last > 1 & diff_next == 1 ~ "start",
    diff_last == 1 & diff_next > 1 ~ "end",
    diff_last == 1 & diff_next == 1 ~ "mid",
    is.na(diff_last) & diff_next == 1 ~ "start",
    diff_last == 1 & is.na(diff_next) ~ "end",
    is.na(diff_last) & is.na(diff_next) ~ NA_character_,
    TRUE ~ "none"
  ))

This is where I'm stuck. I suspect that I need to create an additional variable for grouping that contains the start and end of each sequence so that I can summarise the number of rows per group, but I do not know a way for creating such groups since it involves creating a string based on values from multiple columns.


Solution

  • You could do:

    df %>% 
      group_by(cumsum(c(0, diff(day) - 1))) %>%
      summarise(sequences = paste(first(day), last(day), sep = ' - '),
                length    = n()) %>%
      filter(length > 1) %>%
      select(sequences, length)
    
    #> # A tibble: 2 x 2
    #>   sequences               length
    #>   <chr>                    <int>
    #> 1 2022-01-03 - 2022-01-05      3
    #> 2 2022-01-10 - 2022-01-13      4