Search code examples
rdataframedplyrpurrr

Accelerating some kind of dataframe stretching in R


I have the following travel booking data:

library(dplyr)
bookings <- data.frame(Route = 'AB', DepartureMonth = rep(yearmonth("2013-11"),9),
           EffectiveFrom = c(rep(yearmonth("2013-07"),5), c(rep(yearmonth("2013-08"),4))),
           EffectiveTo = c("2013-08", "2013-09", "2013-10", "2013-11", "2199-12",
                           "2013-09", "2013-10", "2013-11", "2199-12"),
           ConfirmedBooking = c(16, 6, 8, 15, 15, 76, 95, 81, 202)) %>%
  mutate(EffectiveTo = yearmonth(EffectiveTo))
bookings 
Route DepartureMonth EffectiveFrom    EffectiveTo ConfirmedBooking
AB      2013 nov.       2013 july       2013 aug.       16
AB      2013 nov.       2013 july       2013 sept.      6
AB      2013 nov.       2013 july       2013 oct.       8
AB      2013 nov.       2013 july       2013 nov.       15
AB      2013 nov.       2013 july       2199 dec.       15
AB      2013 nov.       2013 aug.       2013 sept.      76
AB      2013 nov.       2013 aug.       2013 oct.       95
AB      2013 nov.       2013 aug.       2013 nov.       81
AB      2013 nov.       2013 aug.       2199 dec.       202

I wrote a code to get the number of bookings at the end of a given month prior to departure:

booking_month_decomposition <- function(Route, DepartureMonth, EffectiveFrom, EffectiveTo, ConfirmedBooking) {
  end_month = if_else(EffectiveTo < DepartureMonth, EffectiveTo, DepartureMonth)
  maxiter = end_month - EffectiveFrom
  
  return(map_dfr(1:maxiter, function(x) data.frame(Route = Route, DepartureMonth=DepartureMonth,
                                                   BookingMonth=EffectiveFrom + x -1,
                                                   ConfirmedBooking=ConfirmedBooking)))
}

bookings %>% pmap_dfr(booking_month_decomposition) %>%
  group_by(Route, DepartureMonth, BookingMonth) %>%
  summarise(ConfirmedBooking = sum(ConfirmedBooking)) %>%
  ungroup()
Route DepartureMonth BookingMonth   ConfirmedBooking
AB      2013 nov.       2013 july       60  
AB      2013 nov.       2013 aug.       498 
AB      2013 nov.       2013 sept.      416 
AB      2013 nov.       2013 oct.       313 

The problem is that the execution of the code takes almost an hour when I start from a dataframe of 170.000 rows to arrive at a dataframe of 34.000 rows.

The booking_month_decomposition function takes too long to run. Am I using the purrr functions wrong?

PS: The number of bookings was supposed to increase when we get closer to departure, but this cannot be seen because to simplify I shortened the departure dataframe by eliminating bookings starting from September.


Solution

  • Looping is really not efficient in R so it's better to find a way to avoid it when possible.

    For your use case, you can create the column BookingMonth containing a vector containing the seq of months between EffectiveFrom and end_month for each rows then unnest that column to duplicate the rows for each month.

    You can then filter for when EffectiveTo is equal to the BookingMonth since EffectiveTo is not inclusive then use group_by to sum the variable ConfirmedBooking.

    library(dplyr)
    library(tidyr)
    library(tsibble)
    
    bookings = bookings %>%
      mutate(end_month = if_else(EffectiveTo < DepartureMonth, EffectiveTo, DepartureMonth)) %>%
      group_by(Route, EffectiveTo, ConfirmedBooking, DepartureMonth) %>%
      do(
        BookingMonth = seq(from = min(.$EffectiveFrom), to = max(.$end_month), by = 1)
      ) %>%
      unnest() %>%
      ungroup() %>%
      filter(EffectiveTo != BookingMonth) %>%
      group_by(Route, DepartureMonth, BookingMonth) %>%
      summarise(ConfirmedBooking = sum(ConfirmedBooking)) %>%
      filter(DepartureMonth != BookingMonth)