sql google-bigquery analytic-functions rolling-average

Windowed Average, accounting for gaps

I need to calculate an average over the preceding 4 weeks...

SELECT
  *,
  AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
                     ORDER BY date_column
                  ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                )
                   AS preceding_4_week_average
FROM
  myTable

The data, however, is "sparse"

In this case my window function should look back "4 weeks" rather than "4 rows".
- A missing date is not a 0 it's implicitly a NULL

 thing | date       | dow | val | avg
   1   | 2018-01-01 |  1  |  1  | NULL  <= AVG({})
   1   | 2018-01-08 |  1  |  2  | 1     <= AVG({1})
   1   | 2018-01-15 |  1  |  3  | 1.5   <= AVG({1,2})
   1   | 2018-01-22 |  1  |  4  | 2     <= AVG({1,2,3})
   1   | 2018-01-29 |  1  |  5  | 2.5   <= AVG({1,2,3,4})

   1   | 2018-02-12 |  1  |  7  | 4     <= AVG({3,4,5})
   1   | 2018-02-19 |  1  |  8  | 5.33  <= AVG({4,5,7})
   1   | 2018-02-26 |  1  |  9  | 6.66  <= AVG({5,7,8})
   1   | 2018-03-05 |  1  |  10 | 8     <= AVG({7,8,9})
   1   | 2018-03-12 |  1  |  11 | 11.25 <= AVG({7,8,9,10})
   1   | 2018-03-19 |  1  |  12 | 9.5   <= AVG({8,9,10,11})

Note : There is no value for 2018-02-05

I would normally approach it in one of two ways...

LEFT JOIN on to a template to "force" all the dates to exist, and rely on AVG() effectively "ignoring" NULLs.

This is less than ideal as the number of "things" is huge and constructing this template is expensive.

SELECT
  *,
  AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
                             ORDER BY dates.date
                          ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                        )
                           AS preceding_4_week_average
FROM
  things
CROSS JOIN
  dates
LEFT JOIN
  myTable
    ON  myTable.date = dates.date
    AND myTable.id   = things.id

Don't use window functions, use a self join instead

This is less than ideal as there are hundreds of columns in myTable and BigQuery isn't performing very well on that.

SELECT
  myTable.*,
  AVG(hist.val)   AS preceding_4_week_average
FROM
  myTable
LEFT JOIN
  myTable   AS hist
    ON  hist.id    = myTable.id
    AND hist.date >= myTable.date - INTERVAL 28 DAYS
    AND hist.date <  myTable.date
GROUP BY
  myTable.column1,
  myTable.column2,
  etc, etc

The actual question

Does anyone else have an alternative, preferably using windowed/analytic functions to "look back 4 weeks" rather than "look back 4 rows"?

Solution

Below is for BigQuery Standard SQL

As you will see - the trick is in using RANGE instead of ROW

#standardSQL
SELECT *,
  AVG(val) OVER(
    PARTITION BY id, dow 
    ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
    RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
  ) AVG
FROM `project.dataset.table`

You can test, play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
  SELECT 1, '2018-01-08', 1, 2  UNION ALL
  SELECT 1, '2018-01-15', 1, 3  UNION ALL
  SELECT 1, '2018-01-22', 1, 4  UNION ALL
  SELECT 1, '2018-01-29', 1, 5  UNION ALL
  SELECT 1, '2018-02-12', 1, 7  UNION ALL
  SELECT 1, '2018-02-19', 1, 8  UNION ALL
  SELECT 1, '2018-02-26', 1, 9  UNION ALL
  SELECT 1, '2018-03-05', 1, 10 UNION ALL
  SELECT 1, '2018-03-12', 1, 11 UNION ALL
  SELECT 1, '2018-03-19', 1, 12  
)
SELECT *,
  AVG(val) OVER(
    PARTITION BY id, dow 
    ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
    RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
  ) avg
FROM `project.dataset.table`
-- ORDER BY date

with result as

Row id      date    dow val avg 
1   1   2018-01-01  1   1   null     
2   1   2018-01-08  1   2   1.0  
3   1   2018-01-15  1   3   1.5  
4   1   2018-01-22  1   4   2.0  
5   1   2018-01-29  1   5   2.5  
6   1   2018-02-12  1   7   4.0  
7   1   2018-02-19  1   8   5.333333333333333    
8   1   2018-02-26  1   9   6.666666666666667    
9   1   2018-03-05  1   10  8.0  
10  1   2018-03-12  1   11  8.5  
11  1   2018-03-19  1   12  9.5