I'm looking through some public transport timetable data and trying to find the time that each vehicle was at the previous stop. There's no vehicle_number in the data, so I just need to find the closest time preceding the current time in the dataset for the previous stop.
library(tidyverse)
data <- tribble(~stop,~prev_stop,~time,
5,4,10,
6,5,10.1,
7,6,10.2,
9,7,10.3,
5,4,11,
6,5,11.1,
7,6,11.2,
9,7,11.3,
5,4,12,
6,5,12.1,
7,6,12.2,
9,7,12.3)
Something like...
data %>%
mutate(time_at_prev_stop = max(time[stop{in another row) == prev_stop{in current row}] & time {in target row}<time{in current row}))
Any thoughts? Thanks so much!
You can join the data on itself, matching prev_stop with stop, filter the rows so that time
is in the correct direction (i.e. previous stops times must be less than current stop times), and then using a helper id
column, group on the original data rows and selecting the max.
Here is a dplyr
implementation, but I would recommend data.table
, because its support of non-equi joins and rolling joins
library(dplyr)
left_join(data %>% mutate(id = row_number()),data,by=c("prev_stop" = "stop")) %>%
filter(time.x>time.y | is.na(time.y)) %>%
arrange(id,desc(time.y)) %>%
group_by(id) %>%
slice_head(n=1) %>%
ungroup() %>%
select(stop, prev_stop, time=time.x, time_at_previous_stop = time.y)
Output:
# A tibble: 12 x 4
stop prev_stop time time_at_previous_stop
<dbl> <dbl> <dbl> <dbl>
1 5 4 10 NA
2 6 5 10.1 10
3 7 6 10.2 10.1
4 9 7 10.3 10.2
5 5 4 11 NA
6 6 5 11.1 11
7 7 6 11.2 11.1
8 9 7 11.3 11.2
9 5 4 12 NA
10 6 5 12.1 12
11 7 6 12.2 12.1
12 9 7 12.3 12.2
A data.table implementation using non-equi join:
library(data.table)
setDT(data)
data[,`:=`(id=.I,prev_time=time)][data, on=.(stop=prev_stop, prev_time<time)][
,.SD[.N,.(stop=i.stop, time = prev_time, prev_stop=stop, time_at_previous_stop = time)], i.id, keep=F
]
A more readable data.table
implementation using rolling join:
d1 = data
d2 = data
setkey(setDT(d1),stop,time)
setkey(setDT(d2),prev_stop, time)
d1[, t:=time][d2,roll=+Inf][,.(stop=i.stop, prev_stop=stop,time, time_at_previous_stop=t)]
Output:
stop prev_stop time time_at_previous_stop
1: 5 4 10.0 NA
2: 5 4 11.0 NA
3: 5 4 12.0 NA
4: 6 5 10.1 10.0
5: 6 5 11.1 11.0
6: 6 5 12.1 12.0
7: 7 6 10.2 10.1
8: 7 6 11.2 11.1
9: 7 6 12.2 12.1
10: 9 7 10.3 10.2
11: 9 7 11.3 11.2
12: 9 7 12.3 12.2