Search code examples
rdata.tabledifftime

How to calculate difference in time between variable rows in R?


I am looking to calculate differences in time for different groups based on beginning work times and end work times. How can I tell R to calculate difftime between two rows based on their labels couched in a group? Below is a sample data set:

library(data.table)


latemail <- function(N, st="2012/01/01", et="2012/02/01") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev

}

#create our data frame
set.seed(42)
dt = latemail(20)
work = setDT(as.data.frame(dt))
work[,worker:= stringi::stri_rand_strings(2, 5)]  
work[,dt:= as.POSIXct(as.character(work$dt), tz = "GMT")]
work[,status:=NA]

#order
setorder(work, worker, dt)

#add work times
work$status[1] = "start"
work$status[5] = "end"
work$status[6] = "start"
work$status[10] = "end"
work$status[11] = "start"
work$status[15] = "end"
work$status[16] = "start"
work$status[20] = "end"

table looks like this now:

                    dt worker status
 1: 2012-01-04 23:11:31  VOuRp  start
 2: 2012-01-09 15:53:16  VOuRp     NA
 3: 2012-01-15 02:56:45  VOuRp     NA
 4: 2012-01-16 21:12:26  VOuRp     NA
 5: 2012-01-20 16:27:31  VOuRp    end
 6: 2012-01-22 15:34:05  VOuRp  start
 7: 2012-01-23 15:01:18  VOuRp     NA
 8: 2012-01-29 03:36:56  VOuRp     NA
 9: 2012-01-29 20:11:02  VOuRp     NA
10: 2012-01-31 02:48:01  VOuRp    end
11: 2012-01-04 10:24:38  u8zw5  start
12: 2012-01-08 17:02:20  u8zw5     NA
13: 2012-01-14 23:33:35  u8zw5     NA
14: 2012-01-15 12:23:52  u8zw5     NA
15: 2012-01-18 03:53:15  u8zw5    end
16: 2012-01-21 03:48:08  u8zw5  start
17: 2012-01-23 02:01:10  u8zw5     NA
18: 2012-01-26 12:51:10  u8zw5     NA
19: 2012-01-29 18:23:46  u8zw5     NA
20: 2012-01-29 22:22:14  u8zw5    end

Answer I'm looking for: ultimately I would like to get the bottom values (labeled worker 1 and worker 2 just because wasn't sure how to do the parallel of set.seed() for stringi). The following code gives me the first row for worker 1, but I'd like each shift for each worker:

difftime(as.POSIXct("2012-01-20 16:27:31"), as.POSIXct("2012-01-04 23:11:31"), units = "hours")
    Work time   time difference in hours  
    worker 1         377.2667 hours
    worker 2         . . . . 

In this example I have an even set of values between workers, but assuming I have variable rows between different workers what would that look like? I'm assuming some sort of difftime formula? I would perfer a data table solution as I am working with large data.


Solution

  • Here is a solution using data.table:

     work[status %in% c("start", "end"), 
            time.diff := ifelse(status == "start", 
            difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA), 
            by = worker][status == "start", sum(time.diff), worker]
    

    we get:

     worker       V1
    1:  VOuRp 580.4989
    2:  u8zw5 540.0453
    > 
    

    where V1 has the sum of all hours from start-end interval for each worker.

    Let's explain it step by step for better understanding.

    STEP 1. Select all rows with start or end status:

    work.se <- work[status %in% c("start", "end")]
    
                        dt worker status
    1: 2012-01-04 23:11:31  VOuRp  start
    2: 2012-01-20 16:27:31  VOuRp    end
    3: 2012-01-22 15:34:05  VOuRp  start
    4: 2012-01-31 02:48:01  VOuRp    end
    5: 2012-01-04 10:24:38  u8zw5  start
    6: 2012-01-18 03:53:15  u8zw5    end
    7: 2012-01-21 03:48:08  u8zw5  start
    8: 2012-01-29 22:22:14  u8zw5    end
    > 
    

    STEP 2: Create a function for calculating the time differences between the current row and the next one. This function will be invoked inside the data.table object. We use the shift function from the same package:

    getDiff <- function(x) {
        difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
    }
    

    getDiff computes the time difference from the next record (within the group) and the current one. It assigns NA for the last row because there is no next value. Then we exclude the NA values from the calculation.

    STEP 3: Invoke it within the data.table syntax:

    work.result <- work.se[, time.diff := ifelse(status == "start", 
        getDiff(dt), NA), by = worker]
    

    we get this:

                        dt worker status time.diff
    1: 2012-01-04 23:11:31  VOuRp  start  377.2667
    2: 2012-01-20 16:27:31  VOuRp    end        NA
    3: 2012-01-22 15:34:05  VOuRp  start  203.2322
    4: 2012-01-31 02:48:01  VOuRp    end        NA
    5: 2012-01-04 10:24:38  u8zw5  start  329.4769
    6: 2012-01-18 03:53:15  u8zw5    end        NA
    7: 2012-01-21 03:48:08  u8zw5  start  210.5683
    8: 2012-01-29 22:22:14  u8zw5    end        NA
    

    STEP 4: Sum the non-NA values for time.diff column for each worker:

    > work.result[status == "start", sum(time.diff), worker]
       worker       V1
    1:  VOuRp 580.4989
    2:  u8zw5 540.0453
    > 
    

    data.table object can be concatenated via [] appended, therefore it can be consolidated into one single sentence for the last part:

    work.se[, time.diff := ifelse(status == "start", 
        getDiff(dt), NA), by = worker][status == "start", sum(time.diff), worker]
    

    FINAL: Putting all together into one single sentence:

    work[status %in% c("start", "end"), 
        time.diff := ifelse(status == "start", 
        difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA), 
        by = worker][status == "start", sum(time.diff), worker]
    

    Check this link for data.table basic syntax. I hope this would help, please let us know if it is what you wanted