Search code examples
rdplyrlubridate

Is there a way to combine dates of an event covering weekend or bank holidays?


I have a dataframe that looks a bit like this:

Event ID Start Date End Data
1 06/02/2024 09/02/2024
1 22/05/2024 24/05/2024
1 28/05/2024 06/06/2024
2 27/06/2024 27/06/2024
2 28/06/2024 28/06/2024

I would like to combine the dates which go over the weekend to combine into one event in the dataframe, while retaining the other entries:

Event ID Start Date End Data
1 06/02/2024 09/02/2024
1 22/05/2024 06/06/2024
2 27/06/2024 28/06/2024

For event id 1 - I want to condense the gap made by the long weekend which lasted from 25/05 to 27/05, while retaining the event id that was in Feb. For event id 2 - I want to condense the sequential dates into one.

This is the chunk of data I tried:

event_data <- data.frame(
  `Event ID` = c(1, 1, 1, 2, 2),
  `Start Date` = c("06/02/2024", "22/05/2024", "28/05/2024", "27/06/2024", "28/06/2024"),
  `End Date` = c("09/02/2024", "24/05/2024", "06/06/2024", "27/06/2024", "28/06/2024")
)

consolidated_data <- event_data %>%
  arrange(Event_ID, Start_Date) %>% 
  group_by(Event_ID) %>%
  mutate(
    # Identify if the current Start_Date is consecutive or overlapping with the previous End_Date
    previous_end_date = lag(End_Date),
    is_consecutive = ifelse(!is.na(previous_end_date) & (Start_Date <= previous_end_date + 1), TRUE, FALSE),
    date_group = cumsum(!is_consecutive)
  ) %>%
  group_by(Event_ID, date_group ) %>%
  summarise(
    Start_Date = min(Start_Date),
    End_Date = max(End_Date),
    .groups = 'drop'
  ) %>%
  ungroup()

print(consolidated_data)

This applied a group called event_id, but didn't group together what I wanted. If anyone can help, I would greatly appreciate it!


Solution

  • Using the data in the Note at the end define a function incr_group which takes the dates defining an interval (the interval from the end of the last row to the beginning of the current row) and computes a logical value which is whether or not the current row should start a new group. Invoke it on each row and then use cumsum on its results within Event Id to get the grouping variable group. Finally summarize the rows in each group down to a single row.

    library(dplyr)
    
    # Args should be dates such that x <= y. x may be NA.
    # Return TRUE if row should start a new group else FALSE
    incr_group <- function(x, y) {
      diff <- as.numeric(y - x)
      consec <- diff == 1
      has_sun <- "Sunday" %in% weekdays(seq(coalesce(x, y), y, by = "day")) 
      is.na(x) || !(consec || (diff <= 4 && has_sun))
    }
    
    event_data %>%
      mutate(`Start Date` = as.Date(`Start Date`, "%d/%m/%Y"),
        `End Date` = as.Date(`End Date`, "%d/%m/%Y"),
        prevEnd = lag(`End Date`), 
        group = Vectorize(incr_group)(prevEnd, `Start Date`) %>% cumsum,
        .by = `Event ID`) %>%
      summarize(`Start Date` = first(`Start Date`), `End Date` = last(`End Date`),
        .by = c(`Event ID`, group)) %>%
      select(-group)
    

    giving

    # A tibble: 3 × 3
      `Event ID` `Start Date` `End Date`
           <dbl> <date>       <date>    
    1          1 2024-02-06   2024-02-09
    2          1 2024-05-22   2024-06-06
    3          2 2024-06-27   2024-06-28
    

    Old

    This works too but the new version above avoids using Reduce.

    library(dplyr)
    library(purrr)
    
    make_groups <- function(group, i, start, end) {
      diff <- as.numeric(start[i] - end[i-1])
      consec <- diff == 1
      weekend <- diff <= 4 && 
        any(weekdays(seq(start[i-1], end[i], by = "day")) %in% 
          c("Saturday", "Sunday"))
      group + !(consec || weekend)
    }
    
    event_data %>%
       mutate(`Start Date` = as.Date(`Start Date`, "%d/%m/%Y"),
              `End Date` = as.Date(`End Date`, "%d/%m/%Y")) %>%
       mutate(group = if (n() == 1) 1 else Reduce(
         f = partial(make_groups, start = `Start Date`, end = `End Date`), 
         x = 2:n(), 
         init = 1,
         acc = TRUE), .by = `Event ID`) %>%
       summarize(`Start Date` = first(`Start Date`), `End Date` = last(`End Date`),
         .by = c(`Event ID`, group)) %>%
       select(-group)
    

    Note

    We used the following data adding check.names = FALSE

    event_data <- data.frame(
      `Event ID` = c(1, 1, 1, 2, 2),
      `Start Date` = c("06/02/2024", "22/05/2024", "28/05/2024", "27/06/2024", "28/06/2024"),
      `End Date` = c("09/02/2024", "24/05/2024", "06/06/2024", "27/06/2024", "28/06/2024"),
       check.names = FALSE)