I have a large data frame (~30,000 rows) where I have two date fields "start_date" and "end_date".
I want to summarise the data such that I have 1 column with all the dates and a second column with a count of all the rows in which that date is between the "start_date" and "end_date".
I can make this work using 2 for loops but it is very inefficient as it is going by one though one comparing about 180 dates to 30,000 rows of date ranges.
Below is an example. Say I have the following dataframe.
df <- tibble(
start_date = c(1,1,2,2,3,3,4,4,5,5),
end_date = c(2,3,4,5,6,7,8,9,10,11)
)
I want this to output a table/dataframe that looks like this
Date Count
1 2
2 4
3 5
4 6
5 7
6 6
7 5
8 4
9 3
10 2
11 1
Is there some TidyVerse functions or anything else that could do this transformation efficiently?
Here's a base R method:
date = seq(min(df$start_date), max(df$end_date))
count = sapply(date, \(x) sum(x >= df$start_date & x <= df$end_date))
data.frame(date, count)
# date count
# 1 1 2
# 2 2 4
# 3 3 5
# 4 4 6
# 5 5 7
# 6 6 6
# 7 7 5
# 8 8 4
# 9 9 3
# 10 10 2
# 11 11 1