Search code examples
r

Counting the number of rows between each pair of months?


I have this dataset in R:

library(ggplot2)
library(dplyr)
library(lubridate)

set.seed(123)
mydf <- data.frame(
  name = sample(LETTERS, 10, replace = TRUE),
  start_date = as.Date("2022-01-01") + sample(0:500, 10, replace = TRUE),
  end_date = as.Date("2022-01-01") + sample(200:700, 10, replace = TRUE)
)

It looks like this:

> head(mydf)
  name start_date   end_date
1    O 2022-01-14 2023-07-02
2    S 2023-01-09 2022-12-03
3    N 2022-06-02 2023-11-08
4    C 2022-03-31 2023-07-09
5    J 2022-04-01 2023-06-12
6    R 2022-09-13 2022-08-14

I want to find out for each year-month between the min start_date and max end_date, how many rows overlap that date. For example, between 2020-Jan and 2020-Feb, how many rows have start_date and end_date between 2020-Jan and 2020-Feb? Between 2020-Feb and 2020-March, how many rows have start_date and end_date between 2020-Feb and 2020-March?

I tried to do this manually:

min_date <- min(mydf$start_date)
max_date <- max(mydf$end_date)

month_seq <- seq(from = floor_date(min_date, "month"), 
                 to = floor_date(max_date, "month"), 
                 by = "month")

valid_counts <- data.frame(month = month_seq) %>%
  rowwise() %>%
  mutate(valid_count = sum(mydf$start_date <= month & mydf$end_date >= month)) %>%
  ungroup()

ggplot(valid_counts, aes(x = month, y = valid_count)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  labs(title = "Number of Valid Rows Over Time",
       x = "Month",
       y = "Valid Count") +
  theme_minimal()

Is there a more direct way to do this in R using predefined functions?


Solution

  • Here an option using data.table::between.

    > v <- do.call('seq', c(Map(\(f, x) as.Date(f(paste0(substr(x, 1, 8), '01'))), 
    +                           c(min, max), mydf[c('start_date', 'end_date')]), 'month'))
    > res <- data.frame(
    +   month=v,
    +   valid=colSums(
    +     mapply(data.table::between, v, mydf['start_date'], mydf['end_date'])
    +   )
    + )
    > plot(res, type='b')
    

    enter image description here