Search code examples
rperformancedplyrdata.tablezoo

Is there a faster alternative to the date extension function using the zoo package?


The code posted at the bottom does a clean job of extending dates expressed both (a) as elapsed number of months per Period_1 in the code and (b) in YYYY-MM format defined as Period_2. Item (b) uses the zoo package as.yearmon() function for extending the YYYY-MM date.

However, when I run this code against the 2m+ row data file this is intended for, running Period_1 (item (a) above) is very fast (only 3.3 seconds) but running the code for Period_2 (item (b) above) is very slow (taking 1.74 minutes total). Is there a fast alternative to the date extensions for Period_2, for example using data.table package or just dplyr?

Code:

start_time <- Sys.time()

library(dplyr)
library(tidyr)
library(zoo)

testDF <-
  data.frame(
    ID = as.numeric(c(rep(1,5),rep(50,3),rep(60,3))),
    Period_1 = as.numeric(c(1:5,1:3,1:3)),
    Period_2 = c("2012-06","2012-07","2012-08","2012-09","2012-10","2013-06","2013-07","2013-08","2012-10","2012-11","2012-12"),
    Bal = as.numeric(c(rep(10,5),21:23,36:34)),
    State = c("XX","AA","BB","CC","XX","AA","BB","CC","SS","XX","AA")
  )

testPeriod_1 <-
  testDF %>%
  tidyr::complete(ID, nesting(Period_1)) %>%
  tidyr::fill(Bal, State, .direction = "down")

testPeriod_2 <- testPeriod_1 %>%
  group_by(ID) %>%
  mutate(Period_2 = as.yearmon(first(Period_2)) + seq(0, by=1/12, length=n())) %>%
  mutate(Period_2 = format(Period_2, "%Y-%m")) %>%
  ungroup

end_time <- Sys.time()
end_time - start_time

Solution

  • Avoid the group-by:

    library(hutilscpp)
    
    testPeriod_1 %>%
      mutate(Period_2 = as.yearmon(Period_2)) %>%
      mutate(Period_2 = cumsum_reset(is.na(Period_2), is.na(Period_2) / 12) + na.locf(Period_2)) %>%
      mutate(Period_2 = format(Period_2, "%Y-%m"))
    
    #Unit: milliseconds
    #                expr     min      lq      mean   median       uq     max neval
    # testPeriod_2_faster  6.0256  6.4595  7.147424  6.54710  6.74730 19.3891   100
    #        testPeriod_2 16.3351 16.8705 18.686059 17.27665 18.46105 31.5112   100
    

    This assumes the first year-month of a group is never NA and you only need to extrapolate.