Search code examples
rtimestampintervalslubridateperiod

Extract the overlaps of time intervals for the calculation of availabilities


I have a dataset on malfunction times. The start of the malfunction is the Begin column and the end is the End column.
In addition there are 3 malfunction categories, which can also overlap in time.
My actual goal is to calculate the sum of the time period per day when there is no malfunction. (The category actually does not matter, I do not need a breakdown of the category).
For better illustration, here is an example data set of the malfunction times over 3 days:

> df_time
# A tibble: 8 × 3
  Category Begin               End                
  <chr>    <dttm>              <dttm>             
1 A        2023-07-15 01:40:11 2023-07-15 13:43:15
2 A        2023-07-16 05:54:44 2023-07-16 10:50:45
3 B        2023-08-16 07:43:09 2023-08-16 16:42:12
4 C        2023-08-16 12:00:00 2023-08-16 13:11:13
5 A        2023-08-16 18:00:00 2023-08-16 19:30:00
6 A        2023-08-17 08:00:00 2023-08-17 13:00:00
7 C        2023-08-17 11:12:45 2023-08-17 19:58:22
8 A        2023-08-17 19:01:45 2023-08-17 23:59:59

I have now tried to visualize the malfunction times graphically: enter image description here

Now I want to calculate a summary per calendar day, which shows how many minutes (or hours or seconds) no single malfunction occurred.
This is the sum of all green time intervals per day: enter image description here

If there would be only one malfunction category, one could simply calculate the 24 hours minus the malfunction interval.
However, I do not get further in dealing with the different overlapping malfunction intervals.
Can anyone help me in calculating the malfunction free periods and/or also the daily malfunction periods.

I have imagined such an evaluation to be somewhat: (Where the results are probably not exactly the same as the example data set above)

  Date       OK_times malfunction_times
  <date>     <chr>    <chr>
1 2023-07-15 17 hours 7 hours
2 2023-07-16 5 Hours  19 hours
3 2023-07-17 3 hours  21 hours

Here is the code for creating the example data frame:

library(tidyverse)

df_time <- tibble(
  Category = c("A", "A", "B", "C", "A", "A", "C", "A"),
  Begin = as_datetime(c(
    "2023-07-15 01:40:11",
    "2023-07-16 05:54:44",
    "2023-08-16 07:43:09",
    "2023-08-16 12:00:00",
    "2023-08-16 18:00:00",
    "2023-08-17 08:00:00",
    "2023-08-17 11:12:45",
    "2023-08-17 19:01:45"
    )),
  End = as_datetime(c(
    "2023-07-15 13:43:15",
    "2023-07-16 10:50:45",
    "2023-08-16 16:42:12",
    "2023-08-16 13:11:13",
    "2023-08-16 19:30:00",
    "2023-08-17 13:00:00",
    "2023-08-17 19:58:22",
    "2023-08-17 23:59:59"
    ))
)

Solution

  • interval_intersects <- function(i1, i2) { 
      # check if there's an overlap between the two intervals
      between(i1$Begin, i2$Begin, i2$End) | between(i1$End, i2$Begin, i2$End) | between(i2$Begin, i1$Begin, i1$End) | between(i2$End, i1$Begin, i1$End)
    }
    
    tidy_intervals <- function(df) {
      out <- df[0,] # empty tibble with the columns of the input df
      
      while (nrow(df) > 0) {
        matched = FALSE
        if (nrow(out) != 0) { # this is so janky but R will try to loop through the dataframe, even when it has no rows, even using seq_along
          
          # for each row in out, check if it intersects with the current row
          for (j in 1:nrow(out)) {
    
            # if it does, update the current output row to be the minimum of the two begin times and the maximum of the two end times
            if (interval_intersects(df[1, ], out[j, ])) {
              matched = TRUE
              out[j, ] <- tibble(
                Begin = min(df[1,]$Begin, out[j,]$Begin),
                End = max(df[1,]$End, out[j,]$End)
              )
              break
            }
          }
        }
        # if the current row didn't intersect with any of the output rows, append it to the output
        if (!matched) {
          out <- out |> add_row(
            Begin = df[1,]$Begin,
            End = df[1,]$End)
        }
    
        # remove the current row from the input
        df <- df[-1, ]
      }
      return(out)
    }
    
    tidy_intervals(df_time |> select(-Category)) |> 
      ##### everything from here until later is from the old answer https://stackoverflow.com/a/76905774/4145280 #####
      mutate(b = as.Date(Begin), e = as.Date(End),
            # create a sequence of dates between begin and end
            days = map2(b, e, ~ seq.Date(.x, .y, by = "1 day"))) |>
      # unnest the days column into many rows
      unnest(days) |>
      # if the beginning date is the same as the date in `days`, then use the original Begin column
      # else, use `days` as a datetime
      mutate(Begin = if_else(b == days, Begin, as_datetime(days)),
             # same with End, but subtracting one minute
             End = if_else(e == days, End, as_datetime(days) + days(1) - seconds(1)), .keep = "unused") |> 
     #### new stuff starts here ####
      mutate(Date = as.Date(Begin), 
             malfunction_times = End - Begin) |>
      reframe(malfunction_times = round(sum(malfunction_times)),
              OK_times = 24 - malfunction_times, .by = Date)
    

    Output:

    # A tibble: 4 × 3
      Date       malfunction_times OK_times
      <date>     <drtn>            <drtn>  
    1 2023-07-15 12 hours          12 hours
    2 2023-07-16  5 hours          19 hours
    3 2023-08-16 10 hours          14 hours
    4 2023-08-17 16 hours           8 hours