Search code examples
sqlgoogle-bigquerysumsql-order-bywindow-functions

SQL window funktion ORDER BY 2 columns, sum over all previous rows in google BigQuery


In google BigQuery, I want to sum(value) over all rows where condition1 is smaller than in the current row AND condition2 is smaller than in the current row. Note that the dataset is ordered by condition1 so all previous rows regarding condition1 are smaller than the current row's value of condition1. Also, I have a very large dataset, so a self-join is not possible with respect to calculation time.

Here is an example:

-- | condition1 | condition2 | value | expected_output |
-- |------------|------------|-------|-----------------|
-- |     1      |      2     |    1  |        0        |
-- |     1      |      2     |    1  |        0        |
-- |     2      |      1     |    1  |        0        |
-- |     2      |      2     |    1  |        0        | 
-- |     2      |      3     |    1  |        2        | --> sum over rows 1 and 2
-- |     3      |      2     |    1  |        1        | --> sum over row 3
-- |     3      |      2     |    1  |        1        | --> sum over row 3
-- |     4      |      5     |    1  |        7        | --> sum over rows 1-7
-- |     5      |      4     |    1  |        7        | --> sum over rows 1-7
-- |     6      |      7     |    1  |        9        | --> sum over rows 1-9

note that in this example value is always 1 to make the example easy, bis value can be any number.

I tried with:

SUM(VALUE) OVER (ORDER BY condition1, condition2 RANGE UNBOUNDED PRECEDING)

but the problem is, that only the first condition condition1 is "checked". It orders by condition2, but still, all rows are considered in the sum, even if the value in condition2 is larger than the value in the current row.

Some query to work with where I already tested the behavior of different windows:

WITH data AS (
  SELECT * 
  FROM UNNEST([
    STRUCT
    (1 as condition1, 2 as condition2,1 as value), 
    (2,1,1),
    (3,2,1),
    (2,2,1),
    (5,4,1),
    (4,5,1),
    (2,3,1),
    (6,7,1),
    (3,2,1)
  ])
)
SELECT
  *
  ,SUM(VALUE) OVER (ORDER BY condition1, condition2 RANGE UNBOUNDED PRECEDING) as test_1
  ,SUM(VALUE) OVER (ORDER BY condition2, condition1 RANGE UNBOUNDED PRECEDING) as test_2
  ,RANK() OVER (ORDER BY condition1, condition2) as rank1
  ,RANK() OVER (ORDER BY condition2, condition1) as rank2
FROM data
ORDER BY condition1

Thanks!

Update: Thanks to @Mikhail Berlyants input I noted that my question was not phrased correctly. I updated the question and the example now.


Solution

  • Try below approach - I know it looks little crazy - but just try - here I am trying to partition calculation by condition2 and then join back to original data

    WITH temp1 AS (
      SELECT *, ROW_NUMBER() OVER(ORDER BY condition1, condition2) AS pos
      FROM your_table
    ), temp2 AS (
      SELECT condition2, ARRAY_AGG(STRUCT(pos, condition1, value) ORDER BY condition1) r
      FROM temp1
      GROUP BY condition2
    ), temp3 AS ( 
      SELECT a.condition1, t1.condition2, SUM(b.value) result, a.pos 
      FROM temp2 t1 LEFT JOIN temp2 t2 ON t1.condition2 > t2.condition2,
      t1.r a JOIN t2.r b ON a.condition1 > b.condition1
      GROUP BY a.pos, condition1, condition2
    )
    SELECT * EXCEPT(pos) 
    FROM temp1 LEFT JOIN temp3 USING(pos, condition1, condition2)    
    

    if applied to sample data in your updated question - output is

    enter image description here

    Second option to try

    WITH temp1 AS (
      SELECT condition1, condition2, SUM(value) value
      FROM your_table
      GROUP BY condition1, condition2
    ), temp2 AS (
      SELECT condition2, ARRAY_AGG(STRUCT(condition1, value) ORDER BY condition1) r
      FROM temp1
      GROUP BY condition2
    ), temp3 AS ( 
      SELECT a.condition1, t1.condition2, SUM(b.value) result
      FROM temp2 t1 LEFT JOIN temp2 t2 ON t1.condition2 > t2.condition2,
      t1.r a JOIN t2.r b ON a.condition1 > b.condition1
      GROUP BY condition1, condition2
    )
    SELECT * 
    FROM your_table LEFT JOIN temp3 USING(condition1, condition2)
    

    with same output of course

    While second version has better chances to survive it has little limitation vs. first version in this answer - namely - first version very easy to modify for condition mentioned in your question - when condition2 <= versus just < (btw, I don't see now this in your question so i might mixing something here - sorry :o)