I have the following travel booking data:
library(dplyr)
bookings <- data.frame(Route = 'AB', DepartureMonth = rep(yearmonth("2013-11"),9),
EffectiveFrom = c(rep(yearmonth("2013-07"),5), c(rep(yearmonth("2013-08"),4))),
EffectiveTo = c("2013-08", "2013-09", "2013-10", "2013-11", "2199-12",
"2013-09", "2013-10", "2013-11", "2199-12"),
ConfirmedBooking = c(16, 6, 8, 15, 15, 76, 95, 81, 202)) %>%
mutate(EffectiveTo = yearmonth(EffectiveTo))
bookings
Route DepartureMonth EffectiveFrom EffectiveTo ConfirmedBooking
AB 2013 nov. 2013 july 2013 aug. 16
AB 2013 nov. 2013 july 2013 sept. 6
AB 2013 nov. 2013 july 2013 oct. 8
AB 2013 nov. 2013 july 2013 nov. 15
AB 2013 nov. 2013 july 2199 dec. 15
AB 2013 nov. 2013 aug. 2013 sept. 76
AB 2013 nov. 2013 aug. 2013 oct. 95
AB 2013 nov. 2013 aug. 2013 nov. 81
AB 2013 nov. 2013 aug. 2199 dec. 202
I wrote a code to get the number of bookings at the end of a given month prior to departure:
booking_month_decomposition <- function(Route, DepartureMonth, EffectiveFrom, EffectiveTo, ConfirmedBooking) {
end_month = if_else(EffectiveTo < DepartureMonth, EffectiveTo, DepartureMonth)
maxiter = end_month - EffectiveFrom
return(map_dfr(1:maxiter, function(x) data.frame(Route = Route, DepartureMonth=DepartureMonth,
BookingMonth=EffectiveFrom + x -1,
ConfirmedBooking=ConfirmedBooking)))
}
bookings %>% pmap_dfr(booking_month_decomposition) %>%
group_by(Route, DepartureMonth, BookingMonth) %>%
summarise(ConfirmedBooking = sum(ConfirmedBooking)) %>%
ungroup()
Route DepartureMonth BookingMonth ConfirmedBooking
AB 2013 nov. 2013 july 60
AB 2013 nov. 2013 aug. 498
AB 2013 nov. 2013 sept. 416
AB 2013 nov. 2013 oct. 313
The problem is that the execution of the code takes almost an hour when I start from a dataframe of 170.000 rows to arrive at a dataframe of 34.000 rows.
The booking_month_decomposition function takes too long to run. Am I using the purrr functions wrong?
PS: The number of bookings was supposed to increase when we get closer to departure, but this cannot be seen because to simplify I shortened the departure dataframe by eliminating bookings starting from September.
Looping is really not efficient in R so it's better to find a way to avoid it when possible.
For your use case, you can create the column BookingMonth
containing a vector containing the seq
of months between EffectiveFrom
and end_month
for each rows then unnest that column to duplicate the rows for each month.
You can then filter for when EffectiveTo
is equal to the BookingMonth
since EffectiveTo
is not inclusive then use group_by
to sum
the variable ConfirmedBooking
.
library(dplyr)
library(tidyr)
library(tsibble)
bookings = bookings %>%
mutate(end_month = if_else(EffectiveTo < DepartureMonth, EffectiveTo, DepartureMonth)) %>%
group_by(Route, EffectiveTo, ConfirmedBooking, DepartureMonth) %>%
do(
BookingMonth = seq(from = min(.$EffectiveFrom), to = max(.$end_month), by = 1)
) %>%
unnest() %>%
ungroup() %>%
filter(EffectiveTo != BookingMonth) %>%
group_by(Route, DepartureMonth, BookingMonth) %>%
summarise(ConfirmedBooking = sum(ConfirmedBooking)) %>%
filter(DepartureMonth != BookingMonth)