Search code examples
sqlmemorygoogle-bigquerysql-order-bywindow-functions

Google BigQuery - why does window function order by cause memory error although used together with partition by


I get a memory error in google BigQuery that I don't understand:

My base table (> 1 billion rows) consists of a user ID, a balance increment per day and the day. From the balance_increment per day I want to return the total balance each time there is a new increment. For the next step I would also require the next day there is a new balance increment. So I do:

select 
    userID
    ,   date
    ,   sum(balance_increment) over (partition by userID order by date) as balance
    ,   lead(date, 1, current_date()) over (partition by userID order by date) as next_date
from my_base_table

Although I used partition by in the over clause I get a memory error with this query caused by the sort operation (the order by if I understood corectly?):

BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 135% of limit.
Top memory consumer(s):
  sort operations used for analytic OVER() clauses: 98%
  other/unattributed: 2%

But when I check how often a unique user ID appears, the most is not even 4000 times. I know that I have a bunch of userIDs (apparently > 31 million as the image (see below) suggests, but I thought when doing a partition by the query will be separated into different slots if necessary?

Here I check how often a single userID occurs. This query btw. works just fine:

SELECT
  userID
  , count(*) as userID_count
FROM my_base_table
GROUP BY userID
ORDER BY userID_count DESC

(sorry, in the image I called it entity instead of userID)

enter image description here

So my questions are:

  1. Did I understand it correctly that the memory error comes from the order by date?
  2. Why is that a big issue when I have less than 4000 occurences that have to be ordered when I use the partition by?
  3. Why does my second query run through although at the end I have to order > 31 million rows?
  4. How can I solve this issue?

Solution

  • I solved the memory issue by pre-ordering the base table by userID and date as suggested by @Samuel who pointed out, that preordering should reduce the key exchange over the nodes - it worked!

    With ordered_base_table as (
    Select * from my_base_table order by userID, date
    )
    
    select 
        userID
        ,   date
        ,   sum(balance_increment) over (partition by userID order by date) as balance
        ,   lead(date, 1, current_date()) over (partition by userID order by date) as next_date
    from ordered_base_table
    

    Thanks!