I need to calculate an average over the preceding 4 weeks...
SELECT
*,
AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
ORDER BY date_column
ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
)
AS preceding_4_week_average
FROM
myTable
The data, however, is "sparse"
In this case my window function should look back "4 weeks" rather than "4 rows".
- A missing date is not a 0
it's implicitly a NULL
thing | date | dow | val | avg
1 | 2018-01-01 | 1 | 1 | NULL <= AVG({})
1 | 2018-01-08 | 1 | 2 | 1 <= AVG({1})
1 | 2018-01-15 | 1 | 3 | 1.5 <= AVG({1,2})
1 | 2018-01-22 | 1 | 4 | 2 <= AVG({1,2,3})
1 | 2018-01-29 | 1 | 5 | 2.5 <= AVG({1,2,3,4})
1 | 2018-02-12 | 1 | 7 | 4 <= AVG({3,4,5})
1 | 2018-02-19 | 1 | 8 | 5.33 <= AVG({4,5,7})
1 | 2018-02-26 | 1 | 9 | 6.66 <= AVG({5,7,8})
1 | 2018-03-05 | 1 | 10 | 8 <= AVG({7,8,9})
1 | 2018-03-12 | 1 | 11 | 11.25 <= AVG({7,8,9,10})
1 | 2018-03-19 | 1 | 12 | 9.5 <= AVG({8,9,10,11})
Note : There is no value for 2018-02-05
I would normally approach it in one of two ways...
AVG()
effectively "ignoring" NULLs.This is less than ideal as the number of "things" is huge and constructing this template is expensive.
SELECT
*,
AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
ORDER BY dates.date
ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
)
AS preceding_4_week_average
FROM
things
CROSS JOIN
dates
LEFT JOIN
myTable
ON myTable.date = dates.date
AND myTable.id = things.id
This is less than ideal as there are hundreds of columns in myTable and BigQuery isn't performing very well on that.
SELECT
myTable.*,
AVG(hist.val) AS preceding_4_week_average
FROM
myTable
LEFT JOIN
myTable AS hist
ON hist.id = myTable.id
AND hist.date >= myTable.date - INTERVAL 28 DAYS
AND hist.date < myTable.date
GROUP BY
myTable.column1,
myTable.column2,
etc, etc
The actual question
Does anyone else have an alternative, preferably using windowed/analytic functions to "look back 4 weeks" rather than "look back 4 rows"?
Below is for BigQuery Standard SQL
As you will see - the trick is in using RANGE
instead of ROW
#standardSQL
SELECT *,
AVG(val) OVER(
PARTITION BY id, dow
ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK)
RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
) AVG
FROM `project.dataset.table`
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
SELECT 1, '2018-01-08', 1, 2 UNION ALL
SELECT 1, '2018-01-15', 1, 3 UNION ALL
SELECT 1, '2018-01-22', 1, 4 UNION ALL
SELECT 1, '2018-01-29', 1, 5 UNION ALL
SELECT 1, '2018-02-12', 1, 7 UNION ALL
SELECT 1, '2018-02-19', 1, 8 UNION ALL
SELECT 1, '2018-02-26', 1, 9 UNION ALL
SELECT 1, '2018-03-05', 1, 10 UNION ALL
SELECT 1, '2018-03-12', 1, 11 UNION ALL
SELECT 1, '2018-03-19', 1, 12
)
SELECT *,
AVG(val) OVER(
PARTITION BY id, dow
ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK)
RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
) avg
FROM `project.dataset.table`
-- ORDER BY date
with result as
Row id date dow val avg
1 1 2018-01-01 1 1 null
2 1 2018-01-08 1 2 1.0
3 1 2018-01-15 1 3 1.5
4 1 2018-01-22 1 4 2.0
5 1 2018-01-29 1 5 2.5
6 1 2018-02-12 1 7 4.0
7 1 2018-02-19 1 8 5.333333333333333
8 1 2018-02-26 1 9 6.666666666666667
9 1 2018-03-05 1 10 8.0
10 1 2018-03-12 1 11 8.5
11 1 2018-03-19 1 12 9.5