Consider the following table:
library(dplyr, warn.conflicts = FALSE)
apple_tbl <-
tribble(
~ "VENDOR", ~ "WEEK_IN_YEAR", ~ "APPLES_SOLD"
, "A", 10, 20
, "A", 15, 10
, "A", 17, 5
, "B", 11, 25
, "B", 12, 18
, "B", 18, 30
, "C", 10, 5
)
apple_tbl
#> # A tibble: 7 × 3
#> VENDOR WEEK_IN_YEAR APPLES_SOLD
#> <chr> <dbl> <dbl>
#> 1 A 10 20
#> 2 A 15 10
#> 3 A 17 5
#> 4 B 11 25
#> 5 B 12 18
#> 6 B 18 30
#> 7 C 10 5
Created on 2023-02-21 with reprex v2.0.2
Notice how the weeks for each vendor, while ordered, are not in consecutive order; for Vendor A
, we do not have weeks 11 through 14, nor 16, for example.
So trying to do a sum of the past-4-weeks'-worth of apples sold by each vendor is not really possible in SQL
, as far as I know, but slider
makes this very easy to calculate:
library(dplyr, warn.conflicts = F)
library(slider)
apple_tbl <-
tribble(
~ "VENDOR", ~ "WEEK_IN_YEAR", ~ "APPLES_SOLD"
, "A", 10, 20
, "A", 15, 10
, "A", 17, 5
, "B", 11, 25
, "B", 12, 18
, "B", 18, 30
, "C", 10, 5
)
apple_tbl %>%
group_by(VENDOR) %>%
mutate(
APPLES_SOLD_PAST_4_WEEKS =
slide_index_sum(
x = APPLES_SOLD # The vector being calculated/summed
, i = WEEK_IN_YEAR # The vector that is being referenced/tracked by `slider`
, before = 4 # From 4 weeks prior
, after = 0 # To the current row's week
)
) %>%
ungroup()
#> # A tibble: 7 × 4
#> VENDOR WEEK_IN_YEAR APPLES_SOLD APPLES_SOLD_PAST_4_WEEKS
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 10 20 20
#> 2 A 15 10 10
#> 3 A 17 5 15
#> 4 B 11 25 25
#> 5 B 12 18 43
#> 6 B 18 30 30
#> 7 C 10 5 5
Created on 2023-02-21 with reprex v2.0.2
As slider
, over an alternative like RcppRoll
, can handle missing values in the "index" column (WEEK_IN_YEAR
for this example).
The closest I can think of (using Snowflake as my reference point) would be this query:
SELECT
VENDOR
, WEEK_IN_YEAR
, APPLES_SOLD
, SUM(APPLES_SOLD) OVER (PARTITION BY VENDOR ORDER BY WEEK_IN_YEAR ASC ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) APPLES_SOLD_PAST_4_ROWS -- Not the same thing
FROM apple_tbl
But this falls apart immediately as the "reference" column (WEEK_IN_YEAR
in this case) is not in consecutive order for each vendor.
The only other thing I can really think of is to have some sort of DIM_DATE
table containing all weeks in the year, and doing a cross/left-join of the main FCT
apple-sales-history table, and fill the missing weeks in (with a 0
for example). However, at scale this seems inelegant, or wasteful in terms of compute/time.
Any suggestion on how to do this in plain SQL
would be appreciated.
Use a self-join on the condition:
library(sqldf)
sqldf("select a.*, sum(b.APPLES_SOLD) as APPLES_SOLD_PAST_4_WEEKS
from apple_tbl a
left join apple_tbl b
on b.WEEK_IN_YEAR between a.WEEK_IN_YEAR - 4 and a.WEEK_IN_YEAR
and a.VENDOR = b.VENDOR
group by a.rowid
order by a.rowid")
giving:
VENDOR WEEK_IN_YEAR APPLES_SOLD APPLES_SOLD_PAST_4_WEEKS
1 A 10 20 20
2 A 15 10 10
3 A 17 5 15
4 B 11 25 25
5 B 12 18 43
6 B 18 30 30
7 C 10 5 5
Note that rollapplyr
in zoo can also handle this. See the examples section on the help page.