Search code examples
rtime-serieszoosliding-window

Is there a fast R function like rollapplyr with increasing window size?


I want to calculate a sum over a sliding window on grouped data.

As I would like to stick to official functions if possible I started with rollapplyr like this:

library(tidyverse)
library(reshape2)
library(zoo)  

data = data.frame(Count=seq(1,10,1),
                  group=c("A","B","A","A","B","B","B","B","A","A"))


window_size <- 3    

data_rolling <- data %>%
  arrange(group) %>%
  group_by(group) %>%
  mutate(Rolling_Count = rollapplyr(Count, width=window_size, FUN=sum, fill = NA)) %>%
  ungroup()

for the first entries which are smaller than width (in this case 3) it gets filled with NA as defined, but I would actually like to have the sum of the possible data there like this:

 Count group Rolling_Count expected_Result
 1     A            NA    1
 3     A            NA    4
 4     A            8     8
 9     A            16    16
10     A            23    23
 2     B            NA    2
 5     B            NA    7
 6     B            13    13
 7     B            18    18
 8     B            21    21

I know that I can replace the width=window_size with something like this:

c(rep(1:window_size,1),rep(window_size:window_size,(n()-window_size)))

to get what I want but this is really slow. In addition this approach would assume that n() is greater than window_size.

So: Is there already an R/zoo function which can handle grouped data like above and in addition data with less than window_size entries and is faster to the above approach?

Thanks for any hints!


Solution

  • A solution based on data.table and RcppRoll that should be much more performant.

    It's not as clean as I would like -- there's actually a partial argument in RcppRoll::roll_sum() that hasn't been implemented yet that would theoretically solve this cleanly, but it doesn't seem like that will be worked anytime soon-- see GH Issue #18 .

    At any rate, until someone implements a rolling sum in R that allows what you need here, adding in a cumsum on the first n - 1 rows seems to be a sensible solution.

    library(data.table)
    library(RcppRoll)
    
    data = data.frame(Count=seq(1,10,1),
                      group=c("A","B","A","A","B","B","B","B","A","A"))
    
    ## Convert to a `data.table` by reference
    setDT(data)
    window_size <- 3    
    
    ## Add a counter row so that we can go back and fill in rows
    ## 1 & 2 of each group
    data[,Group_RowNumber := seq_len(.N), keyby = .(group)]
    
    ## Do a rolling window -- this won't fill in the first 2 rows
    data[,Rolling_Count := RcppRoll::roll_sum(Count,
                                              n = window_size,
                                              align = "right",
                                              fill = NA), keyby = .(group)]
    
    ## Go back and fill in the ones we missed
    data[Group_RowNumber < window_size, Rolling_Count := cumsum(Count), by = .(group)]
    
    data
    
    #     Count group Group_RowNumber Rolling_Count
    #  1:     1     A               1             1
    #  2:     3     A               2             4
    #  3:     4     A               3             8
    #  4:     9     A               4            16
    #  5:    10     A               5            23
    #  6:     2     B               1             2
    #  7:     5     B               2             7
    #  8:     6     B               3            13
    #  9:     7     B               4            18
    # 10:     8     B               5            21