I am looking to calculate differences in time for different groups based on beginning work times and end work times. How can I tell R to calculate difftime between two rows based on their labels couched in a group? Below is a sample data set:
library(data.table)
latemail <- function(N, st="2012/01/01", et="2012/02/01") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
#create our data frame
set.seed(42)
dt = latemail(20)
work = setDT(as.data.frame(dt))
work[,worker:= stringi::stri_rand_strings(2, 5)]
work[,dt:= as.POSIXct(as.character(work$dt), tz = "GMT")]
work[,status:=NA]
#order
setorder(work, worker, dt)
#add work times
work$status[1] = "start"
work$status[5] = "end"
work$status[6] = "start"
work$status[10] = "end"
work$status[11] = "start"
work$status[15] = "end"
work$status[16] = "start"
work$status[20] = "end"
table looks like this now:
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-09 15:53:16 VOuRp NA
3: 2012-01-15 02:56:45 VOuRp NA
4: 2012-01-16 21:12:26 VOuRp NA
5: 2012-01-20 16:27:31 VOuRp end
6: 2012-01-22 15:34:05 VOuRp start
7: 2012-01-23 15:01:18 VOuRp NA
8: 2012-01-29 03:36:56 VOuRp NA
9: 2012-01-29 20:11:02 VOuRp NA
10: 2012-01-31 02:48:01 VOuRp end
11: 2012-01-04 10:24:38 u8zw5 start
12: 2012-01-08 17:02:20 u8zw5 NA
13: 2012-01-14 23:33:35 u8zw5 NA
14: 2012-01-15 12:23:52 u8zw5 NA
15: 2012-01-18 03:53:15 u8zw5 end
16: 2012-01-21 03:48:08 u8zw5 start
17: 2012-01-23 02:01:10 u8zw5 NA
18: 2012-01-26 12:51:10 u8zw5 NA
19: 2012-01-29 18:23:46 u8zw5 NA
20: 2012-01-29 22:22:14 u8zw5 end
Answer I'm looking for:
ultimately I would like to get the bottom values (labeled worker 1 and worker 2 just because wasn't sure how to do the parallel of set.seed()
for stringi). The following code gives me the first row for worker 1, but I'd like each shift for each worker:
difftime(as.POSIXct("2012-01-20 16:27:31"), as.POSIXct("2012-01-04 23:11:31"), units = "hours")
Work time time difference in hours
worker 1 377.2667 hours
worker 2 . . . .
In this example I have an even set of values between workers, but assuming I have variable rows between different workers what would that look like? I'm assuming some sort of difftime formula? I would perfer a data table solution as I am working with large data.
Here is a solution using data.table
:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
we get:
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
where V1
has the sum of all hours from start-end interval for each worker.
Let's explain it step by step for better understanding.
STEP 1. Select all rows with start
or end
status:
work.se <- work[status %in% c("start", "end")]
dt worker status
1: 2012-01-04 23:11:31 VOuRp start
2: 2012-01-20 16:27:31 VOuRp end
3: 2012-01-22 15:34:05 VOuRp start
4: 2012-01-31 02:48:01 VOuRp end
5: 2012-01-04 10:24:38 u8zw5 start
6: 2012-01-18 03:53:15 u8zw5 end
7: 2012-01-21 03:48:08 u8zw5 start
8: 2012-01-29 22:22:14 u8zw5 end
>
STEP 2: Create a function for calculating the time differences between the current row and the next one. This function will be invoked inside the data.table
object. We use the shift
function from the same package:
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
getDiff
computes the time difference from the next record (within the group) and the current one. It assigns NA
for the last row because there is no next value. Then we exclude the NA
values from the calculation.
STEP 3: Invoke it within the data.table
syntax:
work.result <- work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker]
we get this:
dt worker status time.diff
1: 2012-01-04 23:11:31 VOuRp start 377.2667
2: 2012-01-20 16:27:31 VOuRp end NA
3: 2012-01-22 15:34:05 VOuRp start 203.2322
4: 2012-01-31 02:48:01 VOuRp end NA
5: 2012-01-04 10:24:38 u8zw5 start 329.4769
6: 2012-01-18 03:53:15 u8zw5 end NA
7: 2012-01-21 03:48:08 u8zw5 start 210.5683
8: 2012-01-29 22:22:14 u8zw5 end NA
STEP 4: Sum the non-NA
values for time.diff
column for each worker:
> work.result[status == "start", sum(time.diff), worker]
worker V1
1: VOuRp 580.4989
2: u8zw5 540.0453
>
data.table
object can be concatenated via []
appended, therefore it can be consolidated into one single sentence for the last part:
work.se[, time.diff := ifelse(status == "start",
getDiff(dt), NA), by = worker][status == "start", sum(time.diff), worker]
FINAL: Putting all together into one single sentence:
work[status %in% c("start", "end"),
time.diff := ifelse(status == "start",
difftime(shift(dt, fill = NA, type = "lead"), dt, units = "hours"), NA),
by = worker][status == "start", sum(time.diff), worker]
Check this link for data.table
basic syntax.
I hope this would help, please let us know if it is what you wanted