I have a dataframe that looks a bit like this:
Event ID | Start Date | End Data |
---|---|---|
1 | 06/02/2024 | 09/02/2024 |
1 | 22/05/2024 | 24/05/2024 |
1 | 28/05/2024 | 06/06/2024 |
2 | 27/06/2024 | 27/06/2024 |
2 | 28/06/2024 | 28/06/2024 |
I would like to combine the dates which go over the weekend to combine into one event in the dataframe, while retaining the other entries:
Event ID | Start Date | End Data |
---|---|---|
1 | 06/02/2024 | 09/02/2024 |
1 | 22/05/2024 | 06/06/2024 |
2 | 27/06/2024 | 28/06/2024 |
For event id 1 - I want to condense the gap made by the long weekend which lasted from 25/05 to 27/05, while retaining the event id that was in Feb. For event id 2 - I want to condense the sequential dates into one.
This is the chunk of data I tried:
event_data <- data.frame(
`Event ID` = c(1, 1, 1, 2, 2),
`Start Date` = c("06/02/2024", "22/05/2024", "28/05/2024", "27/06/2024", "28/06/2024"),
`End Date` = c("09/02/2024", "24/05/2024", "06/06/2024", "27/06/2024", "28/06/2024")
)
consolidated_data <- event_data %>%
arrange(Event_ID, Start_Date) %>%
group_by(Event_ID) %>%
mutate(
# Identify if the current Start_Date is consecutive or overlapping with the previous End_Date
previous_end_date = lag(End_Date),
is_consecutive = ifelse(!is.na(previous_end_date) & (Start_Date <= previous_end_date + 1), TRUE, FALSE),
date_group = cumsum(!is_consecutive)
) %>%
group_by(Event_ID, date_group ) %>%
summarise(
Start_Date = min(Start_Date),
End_Date = max(End_Date),
.groups = 'drop'
) %>%
ungroup()
print(consolidated_data)
This applied a group called event_id, but didn't group together what I wanted. If anyone can help, I would greatly appreciate it!
Using the data in the Note at the end define a function incr_group
which takes the dates defining an interval (the interval from the end of the last row to the beginning of the current row) and computes a logical value which is whether or not the current row should start a new group. Invoke it on each row and then use cumsum
on its results within Event Id
to get the grouping variable group
. Finally summarize the rows in each group down to a single row.
library(dplyr)
# Args should be dates such that x <= y. x may be NA.
# Return TRUE if row should start a new group else FALSE
incr_group <- function(x, y) {
diff <- as.numeric(y - x)
consec <- diff == 1
has_sun <- "Sunday" %in% weekdays(seq(coalesce(x, y), y, by = "day"))
is.na(x) || !(consec || (diff <= 4 && has_sun))
}
event_data %>%
mutate(`Start Date` = as.Date(`Start Date`, "%d/%m/%Y"),
`End Date` = as.Date(`End Date`, "%d/%m/%Y"),
prevEnd = lag(`End Date`),
group = Vectorize(incr_group)(prevEnd, `Start Date`) %>% cumsum,
.by = `Event ID`) %>%
summarize(`Start Date` = first(`Start Date`), `End Date` = last(`End Date`),
.by = c(`Event ID`, group)) %>%
select(-group)
giving
# A tibble: 3 × 3
`Event ID` `Start Date` `End Date`
<dbl> <date> <date>
1 1 2024-02-06 2024-02-09
2 1 2024-05-22 2024-06-06
3 2 2024-06-27 2024-06-28
This works too but the new version above avoids using Reduce
.
library(dplyr)
library(purrr)
make_groups <- function(group, i, start, end) {
diff <- as.numeric(start[i] - end[i-1])
consec <- diff == 1
weekend <- diff <= 4 &&
any(weekdays(seq(start[i-1], end[i], by = "day")) %in%
c("Saturday", "Sunday"))
group + !(consec || weekend)
}
event_data %>%
mutate(`Start Date` = as.Date(`Start Date`, "%d/%m/%Y"),
`End Date` = as.Date(`End Date`, "%d/%m/%Y")) %>%
mutate(group = if (n() == 1) 1 else Reduce(
f = partial(make_groups, start = `Start Date`, end = `End Date`),
x = 2:n(),
init = 1,
acc = TRUE), .by = `Event ID`) %>%
summarize(`Start Date` = first(`Start Date`), `End Date` = last(`End Date`),
.by = c(`Event ID`, group)) %>%
select(-group)
We used the following data adding check.names = FALSE
event_data <- data.frame(
`Event ID` = c(1, 1, 1, 2, 2),
`Start Date` = c("06/02/2024", "22/05/2024", "28/05/2024", "27/06/2024", "28/06/2024"),
`End Date` = c("09/02/2024", "24/05/2024", "06/06/2024", "27/06/2024", "28/06/2024"),
check.names = FALSE)