Search code examples
rdplyrsliding-windowsplit-apply-combine

Correct use of dplyr functions to compute per-product sales in sliding-window, without needing extra pass or join?


Given a data frame defined by:

set.seed(1)
date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/12/31'), by="day"), 12)
vals <- data.frame(x = rep(1:3, 4), date = date, cost = rnorm(12, 100))
vals
#    x       date      cost
# 1  1 2016-04-07 100.48743
# 2  2 2016-05-15 100.73832
# 3  3 2016-07-27 100.57578
# 4  1 2016-11-25  99.69461
# 5  2 2016-03-14 101.51178
# 6  3 2016-11-20 100.38984
# 7  1 2016-12-06  99.37876
# 8  2 2016-08-25  97.78530
# 9  3 2016-08-13 101.12493
# 10 1 2016-01-23  99.95507
# 11 2 2016-12-27  99.98381
# 12 3 2016-03-03 100.94384

I want to add a new column where the new value for the ith row is the sum of all the cost values for which:

  • the date is less than or equal to the ith date and greater than the ith date minus 90 days
  • and the x value equals the x value of row i. (In this example the combinations of x and date are unique, but in general they may not be.)

I can do this in two different ways:

tmp <- vals %>% group_by(date, x) %>% 
summarise(total = sum(vals$cost[vals$date <= date[1] & vals$date > (date[1] - 90) & vals$x == x[1]]))
vals %>% left_join(tmp)

and

vals %>% rowwise() %>% 
mutate(total = sum(vals$cost[vals$date <= date[1] & vals$date > (date[1] - 90) & vals$x == x]))

Both are pretty slow on my larger data, presumably because of all the subsetting. And I am passing the data frame back into the calculation which feels like a bit of a hack to me.

Is there a way to do this "properly" within dplyr? By which I mean, without having to pass in the data frame and do slow subsetting.

Or if not, is there at least a more efficient way to do this?


Solution

  • Basically, (when ordered by date) you always calculate sum(cost[index_start : index_end]) where index_start and index_end slide over the rows. This can be done more efficiently using the cumulative sum of the cost: sum(cost[index_start : index_end]) = cumsum(cost[index_end]) - cumsum(cost[index_start - 1]). For your data frame the code one possible implementation is the following.

    # arrange by date so all relevant cost come after each other
    vals <- arrange(vals, x, date)
    group_by(vals, x) %>% 
      mutate(
        cumsum_cost = cumsum(cost),
        index_start = map_dbl(
          date,
          function(cur_date, date) {
            min(which(cur_date - days(90) <= date))
          },
          date = date),
        cumsum_cost_90_days_ago = map_dbl(
          index_start,
          function(index_start, cumsum_cost) {
            if (index_start - 1 <= 0) {
              return(0)
            } else {
              cumsum_cost[index_start - 1]
            }
          },
          cumsum_cost = cumsum_cost),
        cost_90_days = cumsum_cost - cumsum_cost_90_days_ago
      )
    

    One could speed this up further if one would be smarter about getting index_start (for example by using the knowledge that the data frame is ordered by date). One easy way for the indices would be rolling joins e.g. in data.table.