find the max value that fits a criterion from another column

I'm looking through some public transport timetable data and trying to find the time that each vehicle was at the previous stop. There's no vehicle_number in the data, so I just need to find the closest time preceding the current time in the dataset for the previous stop.

library(tidyverse)

data <- tribble(~stop,~prev_stop,~time,
5,4,10,
6,5,10.1,
7,6,10.2,
9,7,10.3,
5,4,11,
6,5,11.1,
7,6,11.2,
9,7,11.3,
5,4,12,
6,5,12.1,
7,6,12.2,
9,7,12.3)

Something like...

data %>% 
mutate(time_at_prev_stop = max(time[stop{in another row) == prev_stop{in current row}] & time {in target row}<time{in current row}))

Any thoughts? Thanks so much!

Solution

You can join the data on itself, matching prev_stop with stop, filter the rows so that time is in the correct direction (i.e. previous stops times must be less than current stop times), and then using a helper id column, group on the original data rows and selecting the max.

Here is a dplyr implementation, but I would recommend data.table, because its support of non-equi joins and rolling joins

library(dplyr)

left_join(data %>% mutate(id = row_number()),data,by=c("prev_stop" = "stop")) %>%
  filter(time.x>time.y | is.na(time.y)) %>% 
  arrange(id,desc(time.y)) %>% 
  group_by(id) %>% 
  slice_head(n=1) %>% 
  ungroup() %>% 
  select(stop, prev_stop, time=time.x, time_at_previous_stop = time.y)

Output:

# A tibble: 12 x 4
    stop prev_stop  time time_at_previous_stop
   <dbl>     <dbl> <dbl>                 <dbl>
 1     5         4  10                    NA  
 2     6         5  10.1                  10  
 3     7         6  10.2                  10.1
 4     9         7  10.3                  10.2
 5     5         4  11                    NA  
 6     6         5  11.1                  11  
 7     7         6  11.2                  11.1
 8     9         7  11.3                  11.2
 9     5         4  12                    NA  
10     6         5  12.1                  12  
11     7         6  12.2                  12.1
12     9         7  12.3                  12.2

A data.table implementation using non-equi join:

library(data.table)  

setDT(data)
data[,`:=`(id=.I,prev_time=time)][data, on=.(stop=prev_stop, prev_time<time)][
  ,.SD[.N,.(stop=i.stop, time = prev_time, prev_stop=stop, time_at_previous_stop = time)], i.id, keep=F
]

A more readable data.table implementation using rolling join:

d1 = data
d2 = data
setkey(setDT(d1),stop,time)
setkey(setDT(d2),prev_stop, time)
d1[, t:=time][d2,roll=+Inf][,.(stop=i.stop, prev_stop=stop,time, time_at_previous_stop=t)]

Output:

    stop prev_stop time time_at_previous_stop
 1:    5         4 10.0                    NA
 2:    5         4 11.0                    NA
 3:    5         4 12.0                    NA
 4:    6         5 10.1                  10.0
 5:    6         5 11.1                  11.0
 6:    6         5 12.1                  12.0
 7:    7         6 10.2                  10.1
 8:    7         6 11.2                  11.1
 9:    7         6 12.2                  12.1
10:    9         7 10.3                  10.2
11:    9         7 11.3                  11.2
12:    9         7 12.3                  12.2