Search code examples
rdataframetidyverse

group by two values and select only rows with earliest and latest value


My dataframe sample:

label timestamp           trip_num trip_direction 
1001  2023-10-05 03:43:18 8        0                 
1001  2023-10-05 03:44:42 8        0              
1001  2023-10-05 03:46:29 8        1              
1001  2023-10-05 03:47:54 8        1              
1001  2023-10-05 03:56:29 8        1              
1001  2023-10-05 03:57:24 8        0              
1001  2023-10-05 03:58:55 8        0              
1001  2023-10-05 04:12:33 8        0          
1001  2023-10-05 04:13:45 8        1            
1001  2023-10-05 04:53:08 8        1         
1003  2023-10-05 05:53:18 10       0            
1003  2023-10-05 05:54:42 10       0             
1003  2023-10-05 05:56:29 10       0             
1003  2023-10-05 05:57:54 10       1           
1003  2023-10-05 06:06:29 10       1           
1003  2023-10-05 06:07:24 10       0          
1003  2023-10-05 06:08:55 10       0          
1003  2023-10-05 06:22:33 10       0          
1003  2023-10-05 06:23:45 10       1          
1003  2023-10-05 06:54:08 10       1      

the code to create it:

library(data.table)
df <- data.table(label = c(1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001,
                           1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003, 1003),
                 timestamp = c('2023-10-05 03:43:18', '2023-10-05 03:44:42', '2023-10-05 03:46:29', '2023-10-05 03:47:54', '2023-10-05 03:56:29',
                               '2023-10-05 03:57:24', '2023-10-05 03:58:55', '2023-10-05 04:12:33', '2023-10-05 04:13:45', '2023-10-05 04:53:08',
                               '2023-10-05 05:53:18', '2023-10-05 05:54:42', '2023-10-05 05:56:29', '2023-10-05 05:57:54', '2023-10-05 06:06:29',
                               '2023-10-05 06:07:24', '2023-10-05 06:08:55', '2023-10-05 06:22:33', '2023-10-05 06:23:45', '2023-10-05 06:54:08'),
                 trip_num = c(8,8,8,8,8,8,8,8,8,8,10,10,10,10,10,10,10,10,10,10),
                 trip_direction = c(0,0,1,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1))

I want to group by label, trip_num: take the earliest timestamp value for each pair of label and trip_num where trip_direction = 0, then the newest value for the same pair but trip_direction = 1 but before 1 changes to 0 in the column, and calculate the timestamp difference between those rows so that I get duration. And keep only trip_direction = 0 values. The expected output below:

label timestamp           trip_num trip_direction delta_t 
1001  2023-10-05 03:43:18 8        0              00:13:11   
1001  2023-10-05 03:57:24 8        0              00:55:44
1003  2023-10-05 05:53:18 10       0              00:13:11   
1003  2023-10-05 06:07:24 10       0              00:46:44

Solution

  • If you are open to a tidyverse-based approach, you could try:

    library(tidyverse)
    library(hms)
    
    df %>% 
      group_by(label, trip_num, grp = cumsum(trip_direction != lag(trip_direction, default = 0)) %/% 2) %>% 
      slice(c(1, n())) %>% 
      mutate(delta_t = as_hms(lead(timestamp) - timestamp)) %>% 
      drop_na(delta_t) %>% 
      ungroup() %>% 
      select(-grp) %>% 
      as.data.frame()
    

    This returns

      label           timestamp trip_num trip_direction  delta_t
    1  1001 2023-10-05 03:43:18        8              0 00:13:11
    2  1001 2023-10-05 03:57:24        8              0 00:55:44
    3  1003 2023-10-05 05:53:18       10              0 00:13:11
    4  1003 2023-10-05 06:07:24       10              0 00:46:44
    

    I added as.data.frame() to show delta_t in your desired format. The tibble()-version (just remove the as.data.frame()-part) looks like this:

    # A tibble: 4 × 5
      label timestamp           trip_num trip_direction delta_t
      <dbl> <dttm>                 <dbl>          <dbl> <time> 
    1  1001 2023-10-05 03:43:18        8              0 13'11" 
    2  1001 2023-10-05 03:57:24        8              0 55'44" 
    3  1003 2023-10-05 05:53:18       10              0 13'11" 
    4  1003 2023-10-05 06:07:24       10              0 46'44" 
    

    So, what happend here?

    • First we created an auxiliary grouping column based on the changing of trip_direction. This was necessary to meet your before 1 changes to 0 condition.
    • Next we took the first and the last row of these grouped data using slice.
    • The final steps are simply calculating the difference, dropping rows containing NAs, removing the auxiliary column and the grouping.