Search code examples
rtransport

R: how to combine trip segments into a journey with Transport smart card data


Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.

The current table:

CustomerID  SegmentID   OriginLocation  Destination Start Time  End Time    Fare    TypeOfTransport ..  ..  ..
A001        101         LocationA   LocationB       7:30am      7:45am      1.5        Bus          
A001        102         LocationB   LocationC       7:50am      8:30am      3.5        Train            
A001        103         LocationC   LocationB       17:10pm     18:00pm     3.5        Train            
A001        104         LocationB   LocationA       18:10pm     18:30pm     1.5        Bus          
A002        105         LocationK   LocationY       11:30am     12:30pm     3          Train            
A003        106         LocationP   LocationO       10:23am     11:13am     4          Ferrie           

and covert into sth like:

CustomerID  JourneyID   OriginLocation  Destination Start Time  End Time    Fare    TypeOfTransport    NumOfTrips
A001        1           LocationA       LocationC   7:30am      8:30am      5       Intermodal        2
A001        2           LocationC       LocationA   17:10pm     18:30pm     5       Intermodal        2
A002        6           LocationK       LocationY   11:30am     12:30pm     3       Train             1
A003        8           LocationP       LocationO   10:23am     11:13am     4       Ferrie            1

I'm new to R and have no idea how to start, so any guidance would be appreciated.


Solution

  • Here's a starter for your example data. Things will probably get more complicated with your real data...

    library(dplyr)
    
    ## prepare data
    df<- read.table(header=T, stringsAsFactors = FALSE, text="
    CustomerID  SegmentID   OriginLocation  Destination StartTime  EndTime    Fare    TypeOfTransport
    A001        101         LocationA   LocationB       7:30am      7:45am      1.5        Bus
    A001        102         LocationB   LocationC       7:50am      8:30am      3.5        Train
    A001        103         LocationC   LocationB       17:10pm     18:00pm     3.5        Train
    A001        104         LocationB   LocationA       18:10pm     18:30pm     1.5        Bus
    A002        105         LocationK   LocationY       11:30am     12:30pm     3          Train
    A003        106         LocationP   LocationO       10:23am     11:13am     4          Ferrie")
    df$StartTime <- as.POSIXct(df$StartTime, format="%H:%M%p")
    df$EndTime <- as.POSIXct(df$EndTime, format="%H:%M%p")
    
    
    ## Get group for each observation according to start-/endtimes + gap (60secs+60mins)
    getGrp <- function(EndTimes, StartTimes, maxgap = 60*60) {
      vbetween <- Vectorize(dplyr::between, c("left", "right"))
      none <- Negate(any)
      if (length(EndTimes) > 1) {
        mat <-   vbetween(EndTimes, StartTimes-maxgap, StartTimes)
        idx <- apply(mat, 1, none, na.rm = T)
        grp <- cumsum(c(TRUE, head(idx, -1)))
      } else {
        grp <- 1L
      }
      return(grp)
    }
    
    ## Summarise
    df %>% 
      group_by(CustomerID) %>% 
      mutate(grp = getGrp(EndTime, StartTime)) %>%
      group_by(grp, add = TRUE) %>%
      summarise(Fare = sum(Fare), 
                NumOfTrips = n(), 
                StartTime = format(min(StartTime), "%H:%M"), 
                EndTime = format(max(EndTime), "%H:%M"), 
                OriginLocation = OriginLocation[row_number(1)],
                Destination = tail(Destination, 1),
                TypeOfTransport = list(TypeOfTransport)) %>%
      select(-grp) 
    
    ## Result
    # CustomerID  Fare NumOfTrips StartTime EndTime OriginLocation Destination TypeOfTransport
    # (chr) (dbl)      (int)     (chr)   (chr)          (chr)       (chr)           (chr)
    # 1       A001     5          2     07:30   08:30      LocationA   LocationC        <chr[2]>
    # 2       A001     5          2     17:10   18:30      LocationC   LocationA        <chr[2]>
    # 3       A002     3          1     11:30   12:30      LocationK   LocationY        <chr[1]>
    # 4       A003     4          1     10:23   11:13      LocationP   LocationO        <chr[1]>