I have a booked trip dataset as follows:
bktrips <- data.frame(
userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"),
mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"),
Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"),
Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ),
depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"),
Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"),
Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"),
Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"),
Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)
Now I need to restructure this booked trip dataset to prepare a linked trip dataset.
For example, link the bus and train trip (by the same userID, P001) into one linked trip, and redefine the origin and destination for this journey (O1 and D2, respectively).
We need to use rules to link trips (trip by same userID, destination of previous trip is in proximity of the origin of next trip (within 1 km), time gap between previous trip ending and next trip departing smaller than 60 minutes).
In the booked trip dataset, the variables are:
Can anyone help me here, please? I am a new user of R. Many many thanks in advance.
Here is an approach using dplyr
and geosphere
to calculate distances. I use lubridate
to fix your date column.
First, we fix the classes of your columns. Next, we rely on the fact that trips must occur in temporal order. Therefore, we calculate the distance from the preceding destination with lag
from dplyr
and distHaversine
from geosphere
and the time since the previous departure.
library(dplyr)
library(geosphere)
library(lubridate)
bktrips %>%
mutate(depart_dt = ymd_hms(depart_dt)) %>%
mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
group_by(userID) %>%
arrange(depart_dt,.by_group = TRUE) %>%
mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
dplyr::select(-depart_dt,-contains(c("lat","lon")))
userID mode Origin Destination DistPrevDest TimePrevDep
<fct> <fct> <fct> <fct> <dbl> <drtn>
1 P001 bus O1 D1 NA NA mins
2 P001 train O2 D2 801. 30 mins
3 P001 taxi O3 D3 10434. 150 mins
4 P001 bus O4 D4 547. 40 mins
5 P001 train O5 D5 130. 1220 mins
6 P002 taxi O6 D6 NA NA mins
7 P002 bus O7 D7 3105. 1370 mins
8 P002 train O8 D8 3188. 1440 mins
9 P002 taxi O9 D9 879. 50 mins
Now we can add a TripID
using some logic and cumsum
.
Then we group by TripID
, and use summarize
to redefine all the columns.
bktrips %>%
mutate(depart_dt = ymd_hms(depart_dt)) %>%
bktrips %>%
mutate(depart_dt = ymd_hms(depart_dt)) %>%
mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
group_by(userID) %>%
arrange(depart_dt,.by_group = TRUE) %>%
mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
mutate(TripID = cumsum(!((is.na(DistPrevDest) | DistPrevDest < 1000) & (is.na(TimePrevDep) |TimePrevDep < 60)))) %>%
group_by(userID,TripID) %>%
summarize(mode = paste(mode,collapse = ","),
Origin = first(Origin),
Destination = last(Destination),
depart_dt = paste(depart_dt,collapse = ","),
Olat = first(Olat),
Olon = first(Olon),
Dlat = last(Dlat),
Dlon = last(Dlon))
userID TripID mode Origin Destination depart_dt Olat Olon Dlat Dlon
<fct> <int> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 P001 0 bus,train O1 D2 2019-11-05 08:00:00,2019-11-05 08:30:00 -33.9 151. -33.9 151.
2 P001 1 taxi,bus O3 D4 2019-11-05 11:00:00,2019-11-05 11:40:00 -33.8 151. -33.9 151.
3 P001 2 train O5 D5 2019-11-06 08:00:00 -33.9 151. -33.9 151.
4 P002 0 taxi O6 D6 2019-11-06 09:10:00 -33.9 151. -33.9 151.
5 P002 1 bus O7 D7 2019-11-07 08:00:00 -33.9 151. -33.9 151.
6 P002 2 train,taxi O8 D9 2019-11-08 08:00:00,2019-11-08 08:50:00 -33.9 151. -33.9 151.
I suggest you also include arrival time in your data and instead calculate the difference between departure and the previous arrival.
Edit:
Missed a cumsum()
. Now fixed. Also, don't need rleid
anymore.