Search code examples
sqlrdplyrwindow-functions

Emulating `slider`'s `slide_index_*()` functions in `SQL` for Non-Consecutively-Ordered Sliding Window Frames


Consider the following table:

library(dplyr, warn.conflicts = FALSE)

apple_tbl <-
    tribble(
        ~ "VENDOR", ~ "WEEK_IN_YEAR", ~ "APPLES_SOLD" 
        , "A", 10, 20
        , "A", 15, 10
        , "A", 17, 5
        , "B", 11, 25
        , "B", 12, 18
        , "B", 18, 30
        , "C", 10, 5
    )

apple_tbl
#> # A tibble: 7 × 3
#>   VENDOR WEEK_IN_YEAR APPLES_SOLD
#>   <chr>         <dbl>       <dbl>
#> 1 A                10          20
#> 2 A                15          10
#> 3 A                17           5
#> 4 B                11          25
#> 5 B                12          18
#> 6 B                18          30
#> 7 C                10           5

Created on 2023-02-21 with reprex v2.0.2

Notice how the weeks for each vendor, while ordered, are not in consecutive order; for Vendor A, we do not have weeks 11 through 14, nor 16, for example.

So trying to do a sum of the past-4-weeks'-worth of apples sold by each vendor is not really possible in SQL, as far as I know, but slider makes this very easy to calculate:

library(dplyr, warn.conflicts = F)
library(slider)

apple_tbl <-
    tribble(
        ~ "VENDOR", ~ "WEEK_IN_YEAR", ~ "APPLES_SOLD" 
        , "A", 10, 20
        , "A", 15, 10
        , "A", 17, 5
        , "B", 11, 25
        , "B", 12, 18
        , "B", 18, 30
        , "C", 10, 5
    )

apple_tbl %>%
    group_by(VENDOR) %>%
    mutate(
        APPLES_SOLD_PAST_4_WEEKS = 
            slide_index_sum(
                x = APPLES_SOLD # The vector being calculated/summed
                , i = WEEK_IN_YEAR # The vector that is being referenced/tracked by `slider`
                , before = 4 # From 4 weeks prior
                , after = 0 # To the current row's week
            )
    ) %>%
    ungroup()
#> # A tibble: 7 × 4
#>   VENDOR WEEK_IN_YEAR APPLES_SOLD APPLES_SOLD_PAST_4_WEEKS
#>   <chr>         <dbl>       <dbl>                    <dbl>
#> 1 A                10          20                       20
#> 2 A                15          10                       10
#> 3 A                17           5                       15
#> 4 B                11          25                       25
#> 5 B                12          18                       43
#> 6 B                18          30                       30
#> 7 C                10           5                        5

Created on 2023-02-21 with reprex v2.0.2

As slider, over an alternative like RcppRoll, can handle missing values in the "index" column (WEEK_IN_YEAR for this example).

The closest I can think of (using Snowflake as my reference point) would be this query:

SELECT
   VENDOR
   , WEEK_IN_YEAR
   , APPLES_SOLD
   , SUM(APPLES_SOLD) OVER (PARTITION BY VENDOR ORDER BY WEEK_IN_YEAR ASC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) APPLES_SOLD_PAST_4_ROWS -- Not the same thing
FROM apple_tbl

enter image description here

But this falls apart immediately as the "reference" column (WEEK_IN_YEAR in this case) is not in consecutive order for each vendor.

The only other thing I can really think of is to have some sort of DIM_DATE table containing all weeks in the year, and doing a cross/left-join of the main FCT apple-sales-history table, and fill the missing weeks in (with a 0 for example). However, at scale this seems inelegant, or wasteful in terms of compute/time.

Any suggestion on how to do this in plain SQL would be appreciated.


Solution

  • Use a self-join on the condition:

    library(sqldf)
    
    sqldf("select a.*, sum(b.APPLES_SOLD) as APPLES_SOLD_PAST_4_WEEKS
      from apple_tbl a
      left join apple_tbl b
        on b.WEEK_IN_YEAR between a.WEEK_IN_YEAR - 4 and a.WEEK_IN_YEAR
          and a.VENDOR = b.VENDOR
      group by a.rowid
      order by a.rowid")
    

    giving:

      VENDOR WEEK_IN_YEAR APPLES_SOLD APPLES_SOLD_PAST_4_WEEKS
    1      A           10          20                       20
    2      A           15          10                       10
    3      A           17           5                       15
    4      B           11          25                       25
    5      B           12          18                       43
    6      B           18          30                       30
    7      C           10           5                        5
    

    Note that rollapplyr in zoo can also handle this. See the examples section on the help page.