Search code examples
rtidyversetidyrr-sfspread

How do reconstruct the given booked trip dataset to a desired linked trip dataset using probably spread() and gather() functions in R?


I have a booked trip dataset as follows:

bktrips <- data.frame(
  userID =c("P001", "P001", "P001", "P001", "P001", "P002", "P002", "P002", "P002"), 
  mode = c("bus", "train", "taxi", "bus", "train", "taxi","bus", "train", "taxi"), 
  Origin = c("O1", "O2", "O3", "O4", "O5", "O6", "O7", "O8", "O9"), 
  Destination = c("D1", "D2", "D3", "D4", "D5", "D6", "D7","D8", "D9" ), 
  depart_dt = c("2019-11-05 8:00:00","2019-11-05 8:30:00", "2019-11-05 11:00:00", "2019-11-05 11:40:00", "2019-11-06 8:00:00", "2019-11-06 9:10:00", "2019-11-07 8:00:00", "2019-11-08 8:00:00", "2019-11-08 8:50:00"), 
  Olat = c("-33.87085", "-33.87138", "-33.79504", "-33.87832", "-33.89158", "-33.88993", "-33.89173", "-33.88573", "-33.88505"), 
  Olon = c("151.2073", "151.2039", "151.2737", "151.2174","151.2485", "151.2805","151.2469", "151.2169","151.2156"), 
  Dlat = c("-33.87372", "-33.87384", "-33.88323", "-33.89165", "-33.88993", "-33.89177", "-33.88573", "-33.87731", "-33.88573"), 
  Dlon = c("151.1957", "151.2126", "151.2175", "151.2471","151.2471", "151.2805","151.2514", "151.2175","151.2169")
)

Now I need to restructure this booked trip dataset to prepare a linked trip dataset.

For example, link the bus and train trip (by the same userID, P001) into one linked trip, and redefine the origin and destination for this journey (O1 and D2, respectively).

We need to use rules to link trips (trip by same userID, destination of previous trip is in proximity of the origin of next trip (within 1 km), time gap between previous trip ending and next trip departing smaller than 60 minutes).

In the booked trip dataset, the variables are:

  • Olat = Origin latitude
  • Olon = Origin longitude
  • Dlat = Destination latitude
  • Dlon = Destination longitude

Can anyone help me here, please? I am a new user of R. Many many thanks in advance.


Solution

  • Here is an approach using dplyr and geosphere to calculate distances. I use lubridate to fix your date column.

    First, we fix the classes of your columns. Next, we rely on the fact that trips must occur in temporal order. Therefore, we calculate the distance from the preceding destination with lag from dplyr and distHaversine from geosphere and the time since the previous departure.

    library(dplyr)
    library(geosphere)
    library(lubridate)
    bktrips %>%
      mutate(depart_dt = ymd_hms(depart_dt)) %>%
      mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
      group_by(userID) %>% 
      arrange(depart_dt,.by_group = TRUE) %>%
      mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
             TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
      dplyr::select(-depart_dt,-contains(c("lat","lon")))
      userID mode  Origin Destination DistPrevDest TimePrevDep
      <fct>  <fct> <fct>  <fct>              <dbl> <drtn>     
    1 P001   bus   O1     D1                   NA    NA mins  
    2 P001   train O2     D2                  801.   30 mins  
    3 P001   taxi  O3     D3                10434.  150 mins  
    4 P001   bus   O4     D4                  547.   40 mins  
    5 P001   train O5     D5                  130. 1220 mins  
    6 P002   taxi  O6     D6                   NA    NA mins  
    7 P002   bus   O7     D7                 3105. 1370 mins  
    8 P002   train O8     D8                 3188. 1440 mins  
    9 P002   taxi  O9     D9                  879.   50 mins  
    

    Now we can add a TripID using some logic and cumsum.

    Then we group by TripID, and use summarize to redefine all the columns.

    bktrips %>%
      mutate(depart_dt = ymd_hms(depart_dt)) %>%
      bktrips %>%
      mutate(depart_dt = ymd_hms(depart_dt)) %>%
      mutate_at(vars(contains(c("lat","lon"))),list(~as.numeric(as.character(.)))) %>%
      group_by(userID) %>% 
      arrange(depart_dt,.by_group = TRUE) %>%
      mutate(DistPrevDest = distHaversine(cbind(Olon,Olat),cbind(lag(Dlon),lag(Dlat))),
             TimePrevDep = difftime(depart_dt,lag(depart_dt))) %>%
      mutate(TripID = cumsum(!((is.na(DistPrevDest) | DistPrevDest < 1000) & (is.na(TimePrevDep) |TimePrevDep < 60)))) %>%
      group_by(userID,TripID) %>%
      summarize(mode = paste(mode,collapse = ","),
                Origin = first(Origin),
                Destination = last(Destination),
                depart_dt = paste(depart_dt,collapse = ","),
                Olat = first(Olat),
                Olon = first(Olon),
                Dlat = last(Dlat),
                Dlon = last(Dlon))
      userID TripID mode       Origin Destination depart_dt                                Olat  Olon  Dlat  Dlon
      <fct>   <int> <chr>      <fct>  <fct>       <chr>                                   <dbl> <dbl> <dbl> <dbl>
    1 P001        0 bus,train  O1     D2          2019-11-05 08:00:00,2019-11-05 08:30:00 -33.9  151. -33.9  151.
    2 P001        1 taxi,bus   O3     D4          2019-11-05 11:00:00,2019-11-05 11:40:00 -33.8  151. -33.9  151.
    3 P001        2 train      O5     D5          2019-11-06 08:00:00                     -33.9  151. -33.9  151.
    4 P002        0 taxi       O6     D6          2019-11-06 09:10:00                     -33.9  151. -33.9  151.
    5 P002        1 bus        O7     D7          2019-11-07 08:00:00                     -33.9  151. -33.9  151.
    6 P002        2 train,taxi O8     D9          2019-11-08 08:00:00,2019-11-08 08:50:00 -33.9  151. -33.9  151.
    

    I suggest you also include arrival time in your data and instead calculate the difference between departure and the previous arrival.

    Edit: Missed a cumsum(). Now fixed. Also, don't need rleid anymore.