My dataframe sample:
label timestamp trip_num trip_direction
1001 2023-10-05 03:43:18 8 0
1001 2023-10-05 03:44:42 8 0
1001 2023-10-05 03:46:29 8 1
1001 2023-10-05 03:47:54 8 1
1001 2023-10-05 03:56:29 8 1
1001 2023-10-05 03:57:24 8 0
1001 2023-10-05 03:58:55 8 0
1001 2023-10-05 04:12:33 8 0
1001 2023-10-05 04:13:45 8 1
1001 2023-10-05 04:53:08 8 1
1003 2023-10-05 05:53:18 10 0
1003 2023-10-05 05:54:42 10 0
1003 2023-10-05 05:56:29 10 0
1003 2023-10-05 05:57:54 10 1
1003 2023-10-05 06:06:29 10 1
1003 2023-10-05 06:07:24 10 0
1003 2023-10-05 06:08:55 10 0
1003 2023-10-05 06:22:33 10 0
1003 2023-10-05 06:23:45 10 1
1003 2023-10-05 06:54:08 10 1
the code to create it:
library(data.table)
df <- data.table(label = c(1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001,
1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003),
timestamp = c('2023-10-05 03:43:18', '2023-10-05 03:44:42', '2023-10-05 03:46:29', '2023-10-05 03:47:54', '2023-10-05 03:56:29',
'2023-10-05 03:57:24', '2023-10-05 03:58:55', '2023-10-05 04:12:33', '2023-10-05 04:13:45', '2023-10-05 04:53:08',
'2023-10-05 05:53:18', '2023-10-05 05:54:42', '2023-10-05 05:56:29', '2023-10-05 05:57:54', '2023-10-05 06:06:29',
'2023-10-05 06:07:24', '2023-10-05 06:08:55', '2023-10-05 06:22:33', '2023-10-05 06:23:45', '2023-10-05 06:54:08'),
trip_num = c(8,8,8,8,8,8,8,8,8,8,10,10,10,10,10,10,10,10,10,10),
trip_direction = c(0,0,1,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1))
I want to group by label
, trip_num
: take the earliest timestamp
value for each pair of label
and trip_num
where trip_direction = 0
, then the newest value for the same pair but trip_direction = 1
but before 1 changes to 0 in the column, and calculate the timestamp
difference between those rows so that I get duration. And keep only trip_direction = 0
values. The expected output below:
label timestamp trip_num trip_direction delta_t
1001 2023-10-05 03:43:18 8 0 00:13:11
1001 2023-10-05 03:57:24 8 0 00:55:44
1003 2023-10-05 05:53:18 10 0 00:13:11
1003 2023-10-05 06:07:24 10 0 00:46:44
If you are open to a tidyverse-based approach, you could try:
library(tidyverse)
library(hms)
df %>%
group_by(label, trip_num, grp = cumsum(trip_direction != lag(trip_direction, default = 0)) %/% 2) %>%
slice(c(1, n())) %>%
mutate(delta_t = as_hms(lead(timestamp) - timestamp)) %>%
drop_na(delta_t) %>%
ungroup() %>%
select(-grp) %>%
as.data.frame()
This returns
label timestamp trip_num trip_direction delta_t
1 1001 2023-10-05 03:43:18 8 0 00:13:11
2 1001 2023-10-05 03:57:24 8 0 00:55:44
3 1003 2023-10-05 05:53:18 10 0 00:13:11
4 1003 2023-10-05 06:07:24 10 0 00:46:44
I added as.data.frame()
to show delta_t
in your desired format. The tibble()
-version (just remove the as.data.frame()
-part) looks like this:
# A tibble: 4 × 5
label timestamp trip_num trip_direction delta_t
<dbl> <dttm> <dbl> <dbl> <time>
1 1001 2023-10-05 03:43:18 8 0 13'11"
2 1001 2023-10-05 03:57:24 8 0 55'44"
3 1003 2023-10-05 05:53:18 10 0 13'11"
4 1003 2023-10-05 06:07:24 10 0 46'44"
So, what happend here?
slice
.NA
s, removing the auxiliary column and the grouping.