Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.
The current table:
CustomerID SegmentID OriginLocation Destination Start Time End Time Fare TypeOfTransport .. .. ..
A001 101 LocationA LocationB 7:30am 7:45am 1.5 Bus
A001 102 LocationB LocationC 7:50am 8:30am 3.5 Train
A001 103 LocationC LocationB 17:10pm 18:00pm 3.5 Train
A001 104 LocationB LocationA 18:10pm 18:30pm 1.5 Bus
A002 105 LocationK LocationY 11:30am 12:30pm 3 Train
A003 106 LocationP LocationO 10:23am 11:13am 4 Ferrie
and covert into sth like:
CustomerID JourneyID OriginLocation Destination Start Time End Time Fare TypeOfTransport NumOfTrips
A001 1 LocationA LocationC 7:30am 8:30am 5 Intermodal 2
A001 2 LocationC LocationA 17:10pm 18:30pm 5 Intermodal 2
A002 6 LocationK LocationY 11:30am 12:30pm 3 Train 1
A003 8 LocationP LocationO 10:23am 11:13am 4 Ferrie 1
I'm new to R and have no idea how to start, so any guidance would be appreciated.
Here's a starter for your example data. Things will probably get more complicated with your real data...
library(dplyr)
## prepare data
df<- read.table(header=T, stringsAsFactors = FALSE, text="
CustomerID SegmentID OriginLocation Destination StartTime EndTime Fare TypeOfTransport
A001 101 LocationA LocationB 7:30am 7:45am 1.5 Bus
A001 102 LocationB LocationC 7:50am 8:30am 3.5 Train
A001 103 LocationC LocationB 17:10pm 18:00pm 3.5 Train
A001 104 LocationB LocationA 18:10pm 18:30pm 1.5 Bus
A002 105 LocationK LocationY 11:30am 12:30pm 3 Train
A003 106 LocationP LocationO 10:23am 11:13am 4 Ferrie")
df$StartTime <- as.POSIXct(df$StartTime, format="%H:%M%p")
df$EndTime <- as.POSIXct(df$EndTime, format="%H:%M%p")
## Get group for each observation according to start-/endtimes + gap (60secs+60mins)
getGrp <- function(EndTimes, StartTimes, maxgap = 60*60) {
vbetween <- Vectorize(dplyr::between, c("left", "right"))
none <- Negate(any)
if (length(EndTimes) > 1) {
mat <- vbetween(EndTimes, StartTimes-maxgap, StartTimes)
idx <- apply(mat, 1, none, na.rm = T)
grp <- cumsum(c(TRUE, head(idx, -1)))
} else {
grp <- 1L
}
return(grp)
}
## Summarise
df %>%
group_by(CustomerID) %>%
mutate(grp = getGrp(EndTime, StartTime)) %>%
group_by(grp, add = TRUE) %>%
summarise(Fare = sum(Fare),
NumOfTrips = n(),
StartTime = format(min(StartTime), "%H:%M"),
EndTime = format(max(EndTime), "%H:%M"),
OriginLocation = OriginLocation[row_number(1)],
Destination = tail(Destination, 1),
TypeOfTransport = list(TypeOfTransport)) %>%
select(-grp)
## Result
# CustomerID Fare NumOfTrips StartTime EndTime OriginLocation Destination TypeOfTransport
# (chr) (dbl) (int) (chr) (chr) (chr) (chr) (chr)
# 1 A001 5 2 07:30 08:30 LocationA LocationC <chr[2]>
# 2 A001 5 2 17:10 18:30 LocationC LocationA <chr[2]>
# 3 A002 3 1 11:30 12:30 LocationK LocationY <chr[1]>
# 4 A003 4 1 10:23 11:13 LocationP LocationO <chr[1]>