Search code examples
rintervalslubridate

Separate intersecting time intervals by number of segments (R)


I'm working on a dataset of time intervals. Some intervals overlap. I would like to take the raw interval data, and break it into consecutive intervals, by number of overlaps. In the toy data below, there are 3 intervals. My desired output is a data frame that contains the start and stop of where there's only one ID, then start and stop where ID 1 and ID 2 intersect, then start and stop of where IDs 1-3 intersect, then start and stop of where IDs 1 and 3 intersect, and finally the start and stop of the remainder of ID 1.

library(lubridate)
library(ggplot2)

df <- structure(list(ID = 1:3, Start = structure(c(1690740180, 1690740480, 
1690741380), class = c("POSIXct", "POSIXt"), tzone = "America/Iqaluit"), 
    End = structure(c(1690751520, 1690742140, 1690742280), class = c("POSIXct", 
    "POSIXt"), tzone = "America/Iqaluit")), row.names = 3:5, class = "data.frame")

ggplot(df) + geom_segment(aes(x = Start, xend = End, y = as.factor(ID), yend = as.factor(ID)))

Desired output should look like this:

  Intervals               Start                 End
         1 2023-07-30 14:03:00 2023-07-30 14:07:59
         2 2023-07-30 14:08:00 2023-07-30 14:22:59
         3 2023-07-30 14:23:00 2023-07-30 14:35:40
         2 2023-07-30 14:35:40 2023-07-30 14:38:00
         1 2023-07-30 14:38:00 2023-07-30 15:06:40

I can do this by interpolating the data down to 1 sec and checking for intersections, but I was hoping for a cleaner solution.


Solution

  • Here's a base R solution:

    alltimes <- unique(sort(c(df$Start, df$End)))
    intervals <- sapply(alltimes[-length(alltimes)],
                        function(tm) df$Start <= tm & tm < df$End)
    intervals
    #       [,1]  [,2] [,3]  [,4]  [,5]
    # [1,]  TRUE  TRUE TRUE  TRUE  TRUE
    # [2,] FALSE  TRUE TRUE FALSE FALSE
    # [3,] FALSE FALSE TRUE  TRUE FALSE
    

    In intervals, each row is a row from the original df, each column is a time segment, and the value indicates if the original df row was found in that time segment. We can take the sums of each column to create the Intervals column, and then the Start and End columns are merely the pairs of our alltimes vector.

    data.frame(
      Intervals = colSums(intervals),
      Start = alltimes[-length(alltimes)],
      End = alltimes[-1]
    )
    #   Intervals               Start                 End
    # 1         1 2023-07-30 14:03:00 2023-07-30 14:08:00
    # 2         2 2023-07-30 14:08:00 2023-07-30 14:23:00
    # 3         3 2023-07-30 14:23:00 2023-07-30 14:35:40
    # 4         2 2023-07-30 14:35:40 2023-07-30 14:38:00
    # 5         1 2023-07-30 14:38:00 2023-07-30 17:12:00
    

    I'm not certain if the new End should be the same as the next Start or offset by one second, your expected output uses both. Also, I don't know how your last row is 15:06:40 (is not in your raw data), I suspect it's an artifact of your real data and not the sample.