I'm working on a dataset of time intervals. Some intervals overlap. I would like to take the raw interval data, and break it into consecutive intervals, by number of overlaps. In the toy data below, there are 3 intervals. My desired output is a data frame that contains the start and stop of where there's only one ID, then start and stop where ID 1 and ID 2 intersect, then start and stop of where IDs 1-3 intersect, then start and stop of where IDs 1 and 3 intersect, and finally the start and stop of the remainder of ID 1.
library(lubridate)
library(ggplot2)
df <- structure(list(ID = 1:3, Start = structure(c(1690740180, 1690740480,
1690741380), class = c("POSIXct", "POSIXt"), tzone = "America/Iqaluit"),
End = structure(c(1690751520, 1690742140, 1690742280), class = c("POSIXct",
"POSIXt"), tzone = "America/Iqaluit")), row.names = 3:5, class = "data.frame")
ggplot(df) + geom_segment(aes(x = Start, xend = End, y = as.factor(ID), yend = as.factor(ID)))
Desired output should look like this:
Intervals Start End
1 2023-07-30 14:03:00 2023-07-30 14:07:59
2 2023-07-30 14:08:00 2023-07-30 14:22:59
3 2023-07-30 14:23:00 2023-07-30 14:35:40
2 2023-07-30 14:35:40 2023-07-30 14:38:00
1 2023-07-30 14:38:00 2023-07-30 15:06:40
I can do this by interpolating the data down to 1 sec and checking for intersections, but I was hoping for a cleaner solution.
Here's a base R solution:
alltimes <- unique(sort(c(df$Start, df$End)))
intervals <- sapply(alltimes[-length(alltimes)],
function(tm) df$Start <= tm & tm < df$End)
intervals
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE TRUE TRUE TRUE TRUE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE FALSE TRUE TRUE FALSE
In intervals
, each row is a row from the original df
, each column is a time segment, and the value indicates if the original df
row was found in that time segment. We can take the sums of each column to create the Intervals
column, and then the Start
and End
columns are merely the pairs of our alltimes
vector.
data.frame(
Intervals = colSums(intervals),
Start = alltimes[-length(alltimes)],
End = alltimes[-1]
)
# Intervals Start End
# 1 1 2023-07-30 14:03:00 2023-07-30 14:08:00
# 2 2 2023-07-30 14:08:00 2023-07-30 14:23:00
# 3 3 2023-07-30 14:23:00 2023-07-30 14:35:40
# 4 2 2023-07-30 14:35:40 2023-07-30 14:38:00
# 5 1 2023-07-30 14:38:00 2023-07-30 17:12:00
I'm not certain if the new End
should be the same as the next Start
or offset by one second, your expected output uses both. Also, I don't know how your last row is 15:06:40
(is not in your raw data), I suspect it's an artifact of your real data and not the sample.