Search code examples
rtidyversedbplyr

Is there a dbplyr method to calculate rolling sum based on a date range


I'm using dbplyr to access data in Redshift tables and Tidyverse to do the data wrangling. I'm trying to calculate a rolling sum amount over 7 days.

Unfortunately, none of the packages that I've found suggested on stack overflow, such as 'slider' and 'runner', seem to be compatible with dbplyr (or with the Redshift tables and sql that I'm using).

I'm trying to achieve similar results to this table. In this example, the value in the cum_sum_7Days column is the sum of values in the amount column where the start_date fits between the start_date and previous_7Day columns.

Acct          Start_Date           Previous_7Day           Amount       Cum_sum_7Days

YYYY       8/07/2022 7:04      1/07/2022 7:04               500            500
YYYY       8/07/2022 12:49     1/07/2022 12:49              200            700
YYYY       9/07/2022 11:47     2/07/2022 11:47              300            1000
YYYY       9/07/2022 11:52     2/07/2022 11:52              45.6           1045.6
YYYY       12/07/2022 13:03    5/07/2022 13:03              200            1245.6
YYYY       15/07/2022 13:53    8/07/2022 13:53              200            745.6
YYYY       16/07/2022 12:58    9/07/2022 12:58              300            700
YYYY       16/07/2022 13:28    9/07/2022 13:28              500            1200
YYYY       19/07/2022 12:22    12/07/2022 12:22             200            1400
YYYY       23/07/2022 5:52     16/07/2022 5:52              200            1200
YYYY       26/07/2022 13:01    19/07/2022 13:01             100            300
YYYY       29/07/2022 13:50    22/07/2022 13:50             200            500
YYYY       30/07/2022 13:57    23/07/2022 13:57             300            600
YYYY       3/08/2022 6:17      27/07/2022 6:17              200            700
YYYY       5/08/2022 13:30     29/07/2022 13:30             200            900
YYYY       9/08/2022 13:44     2/08/2022 13:44              200            600
YYYY       12/08/2022 12:13    5/08/2022 12:13              200            600

Note that:

  • My dates are not consecutive
  • The date-time fields are required as the 7 days must be accurate to the hour-minute
  • The rolling window size may be changed from 7 days (14 days, 1 year, etc.)

So, any solution need to handle this.


Solution

  • Yes, this should be possible with dbplyr. But the method will look different that via pure R.

    dbplyr works by translating tidyverse commands into SQL. If no translation is defined for a function then it is unlikely to work correctly. This is why using functions like sum_run from the runner package give a no applicable method error. dbplyr does not know how to translate them into SQL.

    The method we can use follows similar logic to how we would approach this problem in SQL:

    output = tbl1 %>%
      inner_join(tbl1, by = "Acct", suffix = c(".x", ".y")) %>%
      filter(
        Previous_7Day.x <= Start_Date.y,
        Start_Date.y <= Start_Date.x
      ) %>%
      group_by(Acct.x, Start_Date.x, Previous_7Day.x, Amount.x) %>%
      summarise(Cum_sum_7Days = sum(Amount.y) %>%
      select(
        Acct = Acct.x,
        Start_Date = Start_Date.x,
        Previous_7Day = Previous_7Day.x,
        Amount = Amount.x,
        Cum_sum_7Days
      )
    

    The core of the idea is to join the table to itself. After filtering we have all combinations where the .y date is between the .x dates. So we can sum all the .y amounts to produce the rolling sum.