Search code examples
google-bigquery

Window function does not work as expected, why?


I don't understand how is the window frame computed when you don't specify an partition

SELECT   
COUNT(letter) OVER( ORDER BY letter
  ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING
) as letter_count, 
n1, letter
FROM   `my_dataset.my_table` 

So my expectation is that it will look for previous row, current row and next 2 rows and it will count. What is counting? is it going to count the occurance of the current letter? No matter what I'm computing I can't figure out how window function operate when you specify the rows between

This is the output:

letter_count n1 letter
4   2   B
4   4   B
4   1   E
4   4   E
3   3   B
4   5   F
4   2   E
4   4   F
4   0   B
2   7   F
3   2   F

let's take row 1 (NULL, B, B, E )
total count is 3, there are 2B and 1 E ..... if you count all B from the entire table that makes sense - 4

But then you go to row nr 3 and you have (B E E B) you have 2B and 2E .. E in the entire table are just 3, but in current window just 2 . If you count all letters from this window you get 4 and this is fine... but for first one does not make sense because you have 3

What is interesting is that if I modify the query like this:

SELECT   letter,
COUNT(letter) OVER( ORDER BY letter
  ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING
) as letter_count, 
n1
FROM   `my_dataset.my_table` 

the ordered is changed ... altough I'm ordering by letter ... so why other columns are affecting the window function ?

I read the documentation and I couldn't find any information so far about this. It does not make any sense.


Solution

  • The important thing here is that you are ordering by the letter field within your window function.

    With window functions you will do a calculation for each row in the source table. To do this calculation, you construct a new table. It will be ordered in the way you specified.

    It can also be partitioned, which means for each calculation, only certain rows are considered. Since you are not partitioning, the table you construct will include all rows, ordered in the way you specified.

    Here is one possible table, which respects your ORDER BY letter:

    n1 letter
    2   B
    4   B
    0   B
    3   B
    1   E
    4   E
    2   E
    5   F
    4   F
    7   F
    2   F
    

    Note that there are other ways you could order these rows, while respecting your ORDER BY letter command. BigQuery will do whatever is easiest based on how the data are actually (physically) stored. It's not really possible to predict what BigQuery will do! That is why you get different behaviour depending on the exact way you write the query.

    Now, finishing off: your window is now going to go row by row and consider only a subset of this ordered table. It will grab 1 preceding row, the 'target' row, and 2 following rows in order to do the calculation.

    The calculation itself is COUNT(letter). Generally this is going to return 4, except for the first row and the last two rows in the ordered table. Therefore one of the rows for which the letter is B will have 3 instead of 4. And one of the F rows will have 3 and another will have 2.

    This is exactly what you see. Importantly, the output rows are put back in the original input order!


    In summary: your confusion is caused by the nature of what ORDER BY letter does in the window clause. When the calculation is done, everything is put into this order. However, the results retain the original order.

    Because the ORDER BY letter order is underdetermined (several different ways to do it) the results are somewhat random (they are repeatable - BigQuery does whatever is easiest - but small changes to the query such as selecting additional columns will affect the outcome).