I have this dataset in R:
library(ggplot2)
library(dplyr)
library(lubridate)
set.seed(123)
mydf <- data.frame(
name = sample(LETTERS, 10, replace = TRUE),
start_date = as.Date("2022-01-01") + sample(0:500, 10, replace = TRUE),
end_date = as.Date("2022-01-01") + sample(200:700, 10, replace = TRUE)
)
It looks like this:
> head(mydf)
name start_date end_date
1 O 2022-01-14 2023-07-02
2 S 2023-01-09 2022-12-03
3 N 2022-06-02 2023-11-08
4 C 2022-03-31 2023-07-09
5 J 2022-04-01 2023-06-12
6 R 2022-09-13 2022-08-14
I want to find out for each year-month between the min start_date and max end_date, how many rows overlap that date. For example, between 2020-Jan and 2020-Feb, how many rows have start_date and end_date between 2020-Jan and 2020-Feb? Between 2020-Feb and 2020-March, how many rows have start_date and end_date between 2020-Feb and 2020-March?
I tried to do this manually:
min_date <- min(mydf$start_date)
max_date <- max(mydf$end_date)
month_seq <- seq(from = floor_date(min_date, "month"),
to = floor_date(max_date, "month"),
by = "month")
valid_counts <- data.frame(month = month_seq) %>%
rowwise() %>%
mutate(valid_count = sum(mydf$start_date <= month & mydf$end_date >= month)) %>%
ungroup()
ggplot(valid_counts, aes(x = month, y = valid_count)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Number of Valid Rows Over Time",
x = "Month",
y = "Valid Count") +
theme_minimal()
Is there a more direct way to do this in R using predefined functions?
Here an option using data.table::between
.
> v <- do.call('seq', c(Map(\(f, x) as.Date(f(paste0(substr(x, 1, 8), '01'))),
+ c(min, max), mydf[c('start_date', 'end_date')]), 'month'))
> res <- data.frame(
+ month=v,
+ valid=colSums(
+ mapply(data.table::between, v, mydf['start_date'], mydf['end_date'])
+ )
+ )
> plot(res, type='b')