Search code examples
google-bigquerysubquery

Using subquery for _PARTITIONTIME in bigquery does not limit cost


When I run the below query on BQ using standard SQL, it says it will process 76.6TB when running

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >(SELECT * FROM `analytics-dwh.autobidding.activity_list` )
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

The table analytics-dwh.autobidding.activity_list has just one column with a unique list of integers

If I remove the subquery from the above table, the query uses less than 500GB when run

SELECT 
event_time, user_id, activity_id,dbm_insertion_order_id, dbm_total_media_cost_usd 
FROM `raw.5295.activity_*`
WHERE _PARTITIONTIME >TIMESTAMP('2018-12-20')
AND timestamp_micros(event_time)  > (SELECT timestamp_micros(MAX(event_time)) from `essence-analytics-dwh.ml_for_autobidding.nest_na_4q18_activity_updated_daily`)
AND _TABLE_SUFFIX IN ('25','20')

Why does this happen when I use a subquery? Is there a workaround?


Solution

  • Why does this happen with use of subquery?

    In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.

    You can see more about Limiting partitions queried using pseudo columns

    So, in your first query (where you use subquery) - the pruning is not happening (It does not limit use of partitions based on the condition that involves the subquery)

    In second query you use _PARTITIONTIME >TIMESTAMP('2018-12-20') so partitions are limited

    Bottom line: filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table.

    Is there a workaround?

    You should split your task to two steps: calculate filter for _PARTITIONTIME using whatever logic you need off of table analytics-dwh.autobidding.activity_list and then use it (instead of subquery) - using any client of your choice