Search code examples
rdplyrdata.tablelubridate

Summarizing across overlapping dates


I am trying to see how I can create a variable which summarizes observations across multiple dates.

library(data.table)
library(lubridate)
library(magrittr)

sample <- data.table(start = c("2018-12-22 23:00:00",
                               "2018-12-23 06:00:00",
                               "2018-12-22 06:00:00",
                               "2018-12-23 06:00:00"),
                     end = c("2018-12-23 06:00:00",
                             "2018-12-23 13:00:00",
                             "2018-12-23 12:00:00",
                             "2018-12-24 01:00:00"),
                     store = c("A", "A", "B", "B"))

sample[, start:= ymd_hms(start)]
sample[, end := ymd_hms(end)]

sample 

> sample
                 start                 end store
1: 2018-12-22 23:00:00 2018-12-23 06:00:00     A
2: 2018-12-23 06:00:00 2018-12-23 13:00:00     A
3: 2018-12-22 06:00:00 2018-12-23 12:00:00     B
4: 2018-12-23 06:00:00 2018-12-24 01:00:00     B

Here, sample is a time card of "shifts" used across each store. We see that store A has two observations, each with a start and end time. If there was no "bleeding" across dates (e.g. first observation begins on 2018-12-22 and ends on 2018-12-23), I would simply subtract the start and end times, and sum across the stores to get the total amount of minutes used across each stores. Something like:

worked_mins <- sample %>%
.[, date := ymd(substr(start,1,10))] %>%
.[, minutes := end - start] %>%
.[, .(worked_mins = sum(minutes)), by = .(store,date)]

However, I am trying to see how to best sum the number of minutes when shifts overlap across multiple days (potentially even >=2 days).

From the above, the desired output would be:

worked_mins = data.table(store = c("A","A", "B", "B", "B"),
                         date = c("2018-12-22", "2018-12-23",
                                  "2018-12-22", "2018-12-23",
                                  "2018-12-24"),
                         worked_mins = c(1, 13, 18, 30, 1))

> worked_mins
   store       date worked_mins
1:     A 2018-12-22           1
2:     A 2018-12-23          13
3:     B 2018-12-22          18
4:     B 2018-12-23          30
5:     B 2018-12-24           1

Thanks!


Solution

  • An updated solution that counts actual time, not just counting hours. This should take into account fractional hours.

    library(lubridate) # ceiling_date, floor_date
    func <- function(st, en, units = "hours") {
      midns <- ceiling_date(seq(st, en, by = "day"), unit = "day")
      times <- unique(sort(c(midns[ st < midns & midns < en], st, en)))
      if (length(times) < 2) {
        data.table(date = as.Date(floor_date(st)), d = structure(0, class = "difftime", units = units))
      } else {
        data.table(date = as.Date(floor_date(times[-length(times)], unit = "days")), d = `units<-`(diff(times), units))
      }
    }
    
    sample[, rbindlist(Map(func, start, end)), by = .(store)
      ][, .(d = sum(d)), by = .(store, date)]
    #     store       date          d
    #    <char>     <Date> <difftime>
    # 1:      A 2018-12-22    1 hours
    # 2:      A 2018-12-23   13 hours
    # 3:      B 2018-12-22   18 hours
    # 4:      B 2018-12-23   30 hours
    # 5:      B 2018-12-24    1 hours
    

    (The 1 hours is still a numeric column, it just has a label of its units attached; this can be removed easily by wrapping the diff in as.numeric.)

    func works by including midnights between st and en; creating a times ordered vector of these unique timestamps allows us to diff across them, then floor_date them so that we know the date that each diff started.

    You can see what func is doing with this quick demo, one that makes the first line a 0-second difference (for testing and validation):

    copy(sample)[1, end:=start][, rbindlist(Map(func, start, end)), by = .(store)]
    #     store       date          d
    #    <char>     <Date> <difftime>
    # 1:      A 2018-12-22    0 hours
    # 2:      A 2018-12-23    7 hours
    # 3:      B 2018-12-22   18 hours
    # 4:      B 2018-12-23   12 hours
    # 5:      B 2018-12-23   18 hours
    # 6:      B 2018-12-24    1 hours