Search code examples
rdplyrlubridatebizdays

find average incidents per business day


I've a dataset as under:

+----+-------+---------------------+
| ID | SUBID |        date         |
+----+-------+---------------------+
| A  |     1 | 2021-01-01 12:00:00 |
| A  |     1 | 2021-01-02 01:00:00 |
| A  |     1 | 2021-01-02 02:00:00 |
| A  |     1 | 2021-01-03 03:00:00 |
| A  |     2 | 2021-01-05 16:00:00 |
| A  |     2 | 2021-01-06 13:00:00 |
| A  |     2 | 2021-01-07 06:00:00 |
| A  |     2 | 2021-01-08 08:00:00 |
| A  |     2 | 2021-01-08 10:00:00 |
| A  |     2 | 2021-01-08 11:00:00 |
| A  |     3 | 2021-01-09 09:00:00 |
| A  |     3 | 2021-01-10 19:00:00 |
| A  |     3 | 2021-01-11 20:00:00 |
| A  |     3 | 2021-01-12 22:00:00 |
| B  |     1 | 2021-02-01 23:00:00 |
| B  |     1 | 2021-02-02 15:00:00 |
| B  |     1 | 2021-02-03 06:00:00 |
| B  |     1 | 2021-02-04 08:00:00 |
| B  |     2 | 2021-02-05 18:00:00 |
| B  |     2 | 2021-02-05 19:00:00 |
| B  |     2 | 2021-02-06 22:00:00 |
| B  |     2 | 2021-02-06 23:00:00 |
| B  |     2 | 2021-02-07 04:00:00 |
| B  |     2 | 2021-02-08 02:00:00 |
| B  |     3 | 2021-02-09 01:00:00 |
| B  |     3 | 2021-02-10 03:00:00 |
| B  |     3 | 2021-02-11 13:00:00 |
| B  |     3 | 2021-02-12 14:00:00 |
+----+-------+---------------------+

I want to be able to get the time difference between each ID and SUBID group in hours, preferably in terms of business hours, where each of the date that appears on a weekend or a federal holiday can be moved to a nearest weekday (preceding or succeeding) with a time of 23:59:59 as under:

+----+-------+---------------------+------------------------------------------------------------------+
| ID | SUBID |        date         | timediff (hours) with preceding date for each group (ID, SUBID) |
+----+-------+---------------------+------------------------------------------------------------------+
| A  |     1 | 2021-01-01 12:00:00 |                                                                0 |
| A  |     1 | 2021-01-02 01:00:00 |                                                               13 |
| A  |     1 | 2021-01-02 02:00:00 |                                                                1 |
| A  |     1 | 2021-01-03 03:00:00 |                                                                1 |
| A  |     2 | 2021-01-05 16:00:00 |                                                                0 |
| A  |     2 | 2021-01-06 13:00:00 |                                                               21 |
| A  |     2 | 2021-01-07 06:00:00 |                                                               17 |
| A  |     2 | 2021-01-08 08:00:00 |                                                                2 |
| A  |     2 | 2021-01-08 10:00:00 |                                                                2 |
| A  |     2 | 2021-01-08 11:00:00 |                                                                1 |
| A  |     3 | 2021-01-09 09:00:00 |                                                                0 |
| A  |     3 | 2021-01-10 19:00:00 |                                                               36 |
| A  |     3 | 2021-01-11 20:00:00 |                                                                1 |
| A  |     3 | 2021-01-12 22:00:00 |                                                                1 |
| B  |     1 | 2021-02-01 23:00:00 |                                                                0 |
| B  |     1 | 2021-02-02 15:00:00 |                                                               16 |
| B  |     1 | 2021-02-03 06:00:00 |                                                               15 |
| B  |     1 | 2021-02-04 08:00:00 |                                                               26 |
| B  |     2 | 2021-02-05 18:00:00 |                                                                0 |
| B  |     2 | 2021-02-05 19:00:00 |                                                                1 |
| B  |     2 | 2021-02-06 22:00:00 |                                                               27 |
| B  |     2 | 2021-02-06 23:00:00 |                                                                1 |
| B  |     2 | 2021-02-07 04:00:00 |                                                                5 |
| B  |     2 | 2021-02-08 02:00:00 |                                                               22 |
| B  |     3 | 2021-02-09 01:00:00 |                                                                0 |
| B  |     3 | 2021-02-10 03:00:00 |                                                               26 |
| B  |     3 | 2021-02-11 13:00:00 |                                                               11 |
| B  |     3 | 2021-02-12 14:00:00 |                                                                1 |
+----+-------+---------------------+------------------------------------------------------------------+

and lastly I want to calculate the average time which would be the sum of time differences per group (ID, SUBID) divide by the total count per group as under:

+----+-------+------------------------------------------------------------+
| ID | SUBID | Average  time (count per group/ total time diff of group ) |
+----+-------+------------------------------------------------------------+
| A  |     1 | 15/4                                                       |
| A  |     2 | 43/6                                                       |
| A  |     3 | 38/4                                                       |
| B  |     1 | 57/4                                                       |
| B  |     2 | 56/6                                                       |
| B  |     3 | 38/4                                                       |
+----+-------+------------------------------------------------------------+

I'm fairly new to R and I came across lubridate to help me format the dates and I wasable to get the time diff using the code below

df%>%
        group_by(ID, SUBID) %>%
        mutate(time_diff = difftime(date, lag(date), unit = 'min'))

However I'm having troubles getting difference of just the business days time and also getting the average time as per the last table


Solution

  • Welcome on SO! Using dplyr and lubridate:

    Data used:

    library(tidyverse)
    library(lubridate)
    df <- data.frame(ID = c("A","A","A","A"),
                     SUBID = c(1,1,2,2),
                     Date = lubridate::as_datetime(c("2021-01-01 12:00:00","2021-01-02 1:00:00","2021-01-01 2:00:00","2021-01-01 13:00:00")))
    
      ID SUBID                Date
    1  A     1 2021-01-01 12:00:00
    2  A     1 2021-01-02 01:00:00
    3  A     2 2021-01-01 02:00:00
    4  A     2 2021-01-01 13:00:00
    

    Code:

    df %>% 
      group_by(ID, SUBID) %>% 
      mutate(diff = Date - lag(Date)) %>% 
      mutate(diff = ifelse(is.na(diff), 0, diff)) %>% 
      summarise(Average = sum(diff)/n())
    

    Output:

      ID    SUBID Average
      <chr> <dbl>   <dbl>
    1 A         1     6.5
    2 A         2     5.5
    

    Edit: How to handle week_ends

    For the week-ends, the simplier solutions is to change the day to the next monday:

    df %>% 
      mutate(week_day = wday(Date,label = TRUE, abbr = FALSE)) %>%
      mutate(Date = ifelse(week_day == "samedi", Date + days(2),
                           ifelse(week_day == "dimanche", Date + days(1), Date))) %>%
      mutate(Date = as_datetime(Date))
    

    This create the column week_day with the name of the day. If the day is a "samedi" (saturday) or a "dimanche" (sunday), it adds 2 or 1 day to the Date so it becomes a Monday. Then, you just need to reorder the dates (df %>% arrange(ID, SUBID, Date)) and rerun the first code.

    As my local langage is french, you have to change the samedi and dimanche to saturday and sunday

    For the holidays, you can apply the same system by creating a time-interval variable which represents the holidays, test for each date if it is whithin this interval, and if so, change the date to the last day of this interval.