Search code examples
sqlgoogle-bigqueryanalytic-functionsrolling-average

Windowed Average, accounting for gaps


I need to calculate an average over the preceding 4 weeks...

SELECT
  *,
  AVG(val) OVER (PARTITION BY some_identifier, day_of_week_column
                     ORDER BY date_column
                  ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                )
                   AS preceding_4_week_average
FROM
  myTable

The data, however, is "sparse"

In this case my window function should look back "4 weeks" rather than "4 rows".
- A missing date is not a 0 it's implicitly a NULL

 thing | date       | dow | val | avg
   1   | 2018-01-01 |  1  |  1  | NULL  <= AVG({})
   1   | 2018-01-08 |  1  |  2  | 1     <= AVG({1})
   1   | 2018-01-15 |  1  |  3  | 1.5   <= AVG({1,2})
   1   | 2018-01-22 |  1  |  4  | 2     <= AVG({1,2,3})
   1   | 2018-01-29 |  1  |  5  | 2.5   <= AVG({1,2,3,4})

   1   | 2018-02-12 |  1  |  7  | 4     <= AVG({3,4,5})
   1   | 2018-02-19 |  1  |  8  | 5.33  <= AVG({4,5,7})
   1   | 2018-02-26 |  1  |  9  | 6.66  <= AVG({5,7,8})
   1   | 2018-03-05 |  1  |  10 | 8     <= AVG({7,8,9})
   1   | 2018-03-12 |  1  |  11 | 11.25 <= AVG({7,8,9,10})
   1   | 2018-03-19 |  1  |  12 | 9.5   <= AVG({8,9,10,11})

Note : There is no value for 2018-02-05

I would normally approach it in one of two ways...


  1. LEFT JOIN on to a template to "force" all the dates to exist, and rely on AVG() effectively "ignoring" NULLs.

This is less than ideal as the number of "things" is huge and constructing this template is expensive.

SELECT
  *,
  AVG(mytable.val) OVER (PARTITION BY things.id, dates.dow
                             ORDER BY dates.date
                          ROW BETWEEN 4 PRECEDING AND 1 PRECEDING
                        )
                           AS preceding_4_week_average
FROM
  things
CROSS JOIN
  dates
LEFT JOIN
  myTable
    ON  myTable.date = dates.date
    AND myTable.id   = things.id


  1. Don't use window functions, use a self join instead

This is less than ideal as there are hundreds of columns in myTable and BigQuery isn't performing very well on that.

SELECT
  myTable.*,
  AVG(hist.val)   AS preceding_4_week_average
FROM
  myTable
LEFT JOIN
  myTable   AS hist
    ON  hist.id    = myTable.id
    AND hist.date >= myTable.date - INTERVAL 28 DAYS
    AND hist.date <  myTable.date
GROUP BY
  myTable.column1,
  myTable.column2,
  etc, etc


The actual question

Does anyone else have an alternative, preferably using windowed/analytic functions to "look back 4 weeks" rather than "look back 4 rows"?


Solution

  • Below is for BigQuery Standard SQL

    As you will see - the trick is in using RANGE instead of ROW

    #standardSQL
    SELECT *,
      AVG(val) OVER(
        PARTITION BY id, dow 
        ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
        RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
      ) AVG
    FROM `project.dataset.table`   
    

    You can test, play with above using dummy data from your question as below

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT 1 id, DATE '2018-01-01' date, 1 dow, 1 val UNION ALL
      SELECT 1, '2018-01-08', 1, 2  UNION ALL
      SELECT 1, '2018-01-15', 1, 3  UNION ALL
      SELECT 1, '2018-01-22', 1, 4  UNION ALL
      SELECT 1, '2018-01-29', 1, 5  UNION ALL
      SELECT 1, '2018-02-12', 1, 7  UNION ALL
      SELECT 1, '2018-02-19', 1, 8  UNION ALL
      SELECT 1, '2018-02-26', 1, 9  UNION ALL
      SELECT 1, '2018-03-05', 1, 10 UNION ALL
      SELECT 1, '2018-03-12', 1, 11 UNION ALL
      SELECT 1, '2018-03-19', 1, 12  
    )
    SELECT *,
      AVG(val) OVER(
        PARTITION BY id, dow 
        ORDER BY DATE_DIFF(DATE_TRUNC(date, WEEK), DATE_TRUNC(CURRENT_DATE(), WEEK), WEEK) 
        RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING
      ) avg
    FROM `project.dataset.table`
    -- ORDER BY date
    

    with result as

    Row id      date    dow val avg 
    1   1   2018-01-01  1   1   null     
    2   1   2018-01-08  1   2   1.0  
    3   1   2018-01-15  1   3   1.5  
    4   1   2018-01-22  1   4   2.0  
    5   1   2018-01-29  1   5   2.5  
    6   1   2018-02-12  1   7   4.0  
    7   1   2018-02-19  1   8   5.333333333333333    
    8   1   2018-02-26  1   9   6.666666666666667    
    9   1   2018-03-05  1   10  8.0  
    10  1   2018-03-12  1   11  8.5  
    11  1   2018-03-19  1   12  9.5