Search code examples
rdata.tabletidyverse

find the max value that fits a criterion from another column


I'm looking through some public transport timetable data and trying to find the time that each vehicle was at the previous stop. There's no vehicle_number in the data, so I just need to find the closest time preceding the current time in the dataset for the previous stop.

library(tidyverse)

data <- tribble(~stop,~prev_stop,~time,
5,4,10,
6,5,10.1,
7,6,10.2,
9,7,10.3,
5,4,11,
6,5,11.1,
7,6,11.2,
9,7,11.3,
5,4,12,
6,5,12.1,
7,6,12.2,
9,7,12.3)

Something like...

data %>% 
mutate(time_at_prev_stop = max(time[stop{in another row) == prev_stop{in current row}] & time {in target row}<time{in current row}))

Any thoughts? Thanks so much!


Solution

  • You can join the data on itself, matching prev_stop with stop, filter the rows so that time is in the correct direction (i.e. previous stops times must be less than current stop times), and then using a helper id column, group on the original data rows and selecting the max.

    Here is a dplyr implementation, but I would recommend data.table, because its support of non-equi joins and rolling joins

    library(dplyr)
    
    left_join(data %>% mutate(id = row_number()),data,by=c("prev_stop" = "stop")) %>%
      filter(time.x>time.y | is.na(time.y)) %>% 
      arrange(id,desc(time.y)) %>% 
      group_by(id) %>% 
      slice_head(n=1) %>% 
      ungroup() %>% 
      select(stop, prev_stop, time=time.x, time_at_previous_stop = time.y) 
    

    Output:

    # A tibble: 12 x 4
        stop prev_stop  time time_at_previous_stop
       <dbl>     <dbl> <dbl>                 <dbl>
     1     5         4  10                    NA  
     2     6         5  10.1                  10  
     3     7         6  10.2                  10.1
     4     9         7  10.3                  10.2
     5     5         4  11                    NA  
     6     6         5  11.1                  11  
     7     7         6  11.2                  11.1
     8     9         7  11.3                  11.2
     9     5         4  12                    NA  
    10     6         5  12.1                  12  
    11     7         6  12.2                  12.1
    12     9         7  12.3                  12.2
    

    A data.table implementation using non-equi join:

    library(data.table)  
    
    setDT(data)
    data[,`:=`(id=.I,prev_time=time)][data, on=.(stop=prev_stop, prev_time<time)][
      ,.SD[.N,.(stop=i.stop, time = prev_time, prev_stop=stop, time_at_previous_stop = time)], i.id, keep=F
    ]
    

    A more readable data.table implementation using rolling join:

    d1 = data
    d2 = data
    setkey(setDT(d1),stop,time)
    setkey(setDT(d2),prev_stop, time)
    d1[, t:=time][d2,roll=+Inf][,.(stop=i.stop, prev_stop=stop,time, time_at_previous_stop=t)]
    

    Output:

        stop prev_stop time time_at_previous_stop
     1:    5         4 10.0                    NA
     2:    5         4 11.0                    NA
     3:    5         4 12.0                    NA
     4:    6         5 10.1                  10.0
     5:    6         5 11.1                  11.0
     6:    6         5 12.1                  12.0
     7:    7         6 10.2                  10.1
     8:    7         6 11.2                  11.1
     9:    7         6 12.2                  12.1
    10:    9         7 10.3                  10.2
    11:    9         7 11.3                  11.2
    12:    9         7 12.3                  12.2