Search code examples
mysqljoingoogle-bigquerysemi-join

Best JOIN performance on Google BigQuery


I have two large tables, a base table and a table I'm using to filter. I want the keys where 'my_filter' is true.

Which one gives me the best performance Google BigQuery?

Standard Join

select
[fields]
from base_table
join filter_table
on filter_table.my_key = base_table.my_key
where filter_table.my_filter

Join on a subquery

select
[fields]
from base_table
join (
  select my_key from filter_table where my_filter
) filter_table
on filter_table.my_key = base_table.my_key

Semi-join

select
[fields]
from base_table
where my_key in (
  select my_key from filter_table where my_filter
)

For context: the Google BQ docs discuss semi-joins but don't explain why you would use them.


Solution

  • I got curious with your question and decided to run a test in our BQ to see how results would change.

    I've simulated a query, hopefully it does a similar job to what you proposed. It processed 169 GBs of data with no caching enabled. Here's the results I found:

    Standard Join: 10.2s

    enter image description here

    Join on Subquery: 10.6s

    enter image description here

    Semi Join: 10.2s

    enter image description here

    Conclusion

    As Elliott said, try using Standard SQL. Other than that, BigQuery will, for the most part, regardless of your query, process everything in a few seconds.