Search code examples
performancegoogle-bigquerypartitioning

Why is there such a difference in performance between these two queries?


Can someone explain to me why the first query performs well (only processes 254KB) while the second two queries process 750MB? The only difference between these queries is that the first one contains a hardcoded list of IDs, the second query gets the exact same list of IDs from a table (no hardcoding), and the third query gets its list of IDs from a hardcoded array.

The src_id field is partitioned the same in all tables.

The demo_table contains 2.7 million records distributed evenly across 10,000 partitions.

The id_table contains only 1 field and 3 records.

-- this query performs well - only processes 254KB of data.
select *
from `myproject.mydataset.demo_table`
where src_id in ( 1435718211154057483 
                , 1260912821261010025 
                , 3916818671952292004)
;


-- this query performs poorly - processes 750 of data. 
-- the subquery returns only 3 rows, the exact same ids as used above.
select *
from `myproject.mydataset.demo_table`
where src_id in (select src_id from myproject.mydataset.id_table) 
;

-- this query performs poorly - processes 750 of data. 
-- This query example is included only to demonstrate that the performance
-- degradation seems to related to the presence of a subquery.
select *
from `myproject.mydataset.demo_table`
where src_id in (select src_id
                  from unnest((
                          select [ 1435718211154057483 
                                 , 1260912821261010025 
                                 , 3916818671952292004])) as src_id)

It looks like BigQuery is executing the query differently depending on whether the IN clause contains a subquery or a simple list of values.

What I am really trying to accomplish is to get the second query working as well as the first (no hardcoding). Can anyone explain to me how to construct this query for better performance?


Solution

  • Per the documentation:

    To limit the partitions that are scanned in a query, use a constant expression in your filter. If you use dynamic expressions in your query filter, BigQuery must scan all of the partitions.

    Your first query uses a constant expression, ie a list of values. Your other examples do not use a constant expression, they are using subqueries.