Search code examples
rdatetimetidyverseposix

In R, how do I create a histogram of time intervals?


I have a table of start and stop times for jobs, that I would like to plot. The input table looks like this:

job_started_at job_finished_at position
2024-04-29 18:35:09 2024-04-29 19:36:25 1
2024-04-29 18:33:19 2024-04-29 20:34:40 2
2024-04-29 19:37:17 2024-04-29 19:44:51 1
2024-04-29 21:12:35 2024-04-29 21:31:31 2
2024-04-29 20:31:52 2024-04-29 20:53:23 1

with multiple entries per day. There are 2 machines that handle jobs (indicated by position 1 and 2), thus the time intervals can not overlap for a given position. I want to show in a histogram the workload of each machine per hour for each day in the week (Mon, Tue, Wed, Thu, Fri, Sat, Sun). I was thinking about 7 different plots (1 per day) in which each hour gets one bar of the histogram. So if a job starts at 12:30 and finishes at 13:15 it should count 50% towards the 12h bar and 25% towards the 13h bar for example. Since the time intervals can overlap the hour mark, I don't know how to handle this problem.

I was thinking about the following workflow:

  • add a column that indicates the weekday of the job (for grouping later)
  • split all jobs on the hour mark (so 12:30 - 13:15 turns into 2 rows -> 12:30-13:00 and 13:00-13:15), this also has to work for multiple splits in one job
  • calculate usage for each job (percentage of workload (for example 12:30-13:00 -> 50%)
  • summarize by weekday and hour and take the mean of all workloads

This also has the problem that the machine might not work at all for multiple hours, but it should still count as 0% for the mean calculation above.

Do you have better ideas or can help me implement my idea?

Thanks a lot!


Solution

  • Here is an approach using the tidyverse (which includes lubridate). The output is a dataframe that you can then use as the basis for your chart.

    #set initial data as per question
    df <- structure(list(job_started_at = c("2024-04-29 18:35:09", "2024-04-29 18:33:19", 
                                            "2024-04-29 19:37:17", "2024-04-29 21:12:35", 
                                            "2024-04-29 20:31:52"
    ), job_finished_at = c("2024-04-29 19:36:25", "2024-04-29 20:34:40", 
                           "2024-04-29 19:44:51", "2024-04-29 21:31:31", 
                           "2024-04-29 20:53:23"
    ), position = c(1L, 2L, 1L, 2L, 1L)), class = "data.frame", row.names = c(NA, -5L))
    
    df <- df %>% 
      mutate(across(starts_with("job"), ymd_hms),                  #convert to datetimes
             job_int = interval(job_started_at, job_finished_at))  #calculate intervals
    
    #create a df of the hours to be analysed
    hours_df <- tibble(hour = seq(from = ymd_hms("2024-04-29 17:00:00"),
                                  to = ymd_hms("2024-04-29 22:00:00"),
                                  by = "hour")) %>% 
      mutate(hour_int = interval(hour, hour + hours(1)))           #intervals
             
    #for each item, calculate a vector of overlaps, unnest, group and summarise
    overlaps_df <- df %>% 
      mutate(overlap = map(job_int, ~seconds(intersect(hours_df$hour_int, .))/3600),
             hour_start = list(hours_df$hour)) %>% 
      unnest_longer(c(overlap, hour_start)) %>% 
      group_by(position, hour_start) %>% 
      summarise(utilisation = sum(overlap, na.rm = TRUE))  
    
    overlaps_df
    
    # A tibble: 12 × 3
    # Groups:   position [2]
       position hour_start          utilisation
          <int> <dttm>                    <dbl>
     1        1 2024-04-29 17:00:00       0    
     2        1 2024-04-29 18:00:00       0.414
     3        1 2024-04-29 19:00:00       0.733
     4        1 2024-04-29 20:00:00       0.359
     5        1 2024-04-29 21:00:00       0    
     6        1 2024-04-29 22:00:00       0    
     7        2 2024-04-29 17:00:00       0    
     8        2 2024-04-29 18:00:00       0.445
     9        2 2024-04-29 19:00:00       1    
    10        2 2024-04-29 20:00:00       0.578
    11        2 2024-04-29 21:00:00       0.316
    12        2 2024-04-29 22:00:00       0