google-bigquery out-of-memory sql-order-by query-optimization

Bigquery resources exceeded during query execution

I have a daily scheduler to run the job on Bigquery, however, it crashed due to running out of memory usage. The job consists of the most updated information from each of the 5 tables, which means I used over( ... order by) five times to query the updated record from each table and it consumed a lot of memory usage. Is there any efficient way to fix the error by refactoring the query?

Here's the brief code structure:

CREATE TEMP TABLE main_info AS

WITH orders_1 AS(
    select 
    * except(rnk)
    from(
        select
              *,
              ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
        from order_1
        )
        where rnk = 1
),


orders_2 AS(
    select 
    * except(rnk)
    from(
        select
              *,
              ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
        from order_2
        )
        where rnk = 1
),

orders_3 AS(
    select 
    * except(rnk)
    from(
        select
              *,
              ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
        from order_3
        )
        where rnk = 1
)

SELECT 
*
FROM orders_1 o1
LEFT JOIN orders_2 o2
ON o1.order_id = o2.order_id
LEFT JOIN orders_3 o3
ON o1.order_id = o3.order_id

I was expecting to reduce memory usage under the limit. I did some research and found out to replace row_number() over( ... order by) with array_agg() to optimize the performance or to create the temp table for each table and combine it all? is there any better advice?

Solution

I'm not sure whether this will solve your problem, but we could definitely use QUALIFY to simplify your CTEs. For example:

 SELECT *
   FROM order_1
QUALIFY ROW_NUMBER() OVER(order_window) = 1
 WINDOW order_window AS (
     PARTITION BY order_id
      ORDER BY update_time DESC
 )

(also uses WINDOW for readability)

It's possible that this will help by eliminating subqueries, but that depends on whether it's already optimised to the same thing behind the scenes.

Other ideas:

do the left joins get you very different results to inner joins? If so, you could pre-empt this by prefiltering your second and third CTEs to not include order IDs that are just going to be dropped.
does it have to be a temporary table you create? Or could you create full tables for each of the CTEs instead and build this in stages?