The general advice for optimising BigQuery joins from the docs is as follows
As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
I'm still learning BigQuery so am exploring the public datasets, specifically the bigquery-public-data.uspto_oce_litigation
dataset. The code below only requires the public dataset.
Naively following the advice results in a query that doesn't finish within 2mins of elapsed time
WITH
docs_and_cases AS (
SELECT
cases.case_name,
cases.court_name,
docs.case_row_id,
docs.short_description
FROM
`bigquery-public-data.uspto_oce_litigation.documents` AS docs
JOIN
`bigquery-public-data.uspto_oce_litigation.cases` AS cases
ON
docs.case_row_id = docs.case_row_id
WHERE
docs.short_description IS NOT NULL
AND cases.case_name IS NOT NULL
AND docs.upload_date IS NOT NULL )
SELECT
docs_and_cases.case_row_id,
attorneys.name,
attorneys.contactinfo,
docs_and_cases.court_name,
docs_and_cases.short_description,
FROM
docs_and_cases
JOIN
`bigquery-public-data.uspto_oce_litigation.attorneys` AS attorneys
ON
docs_and_cases.case_row_id = attorneys.case_row_id
Before I found this advice I constructed the following query, which completes in less than 20secs of elapsed time but is running almost completely counter to the advice on table order.
WITH
cases_and_attorneys AS (
SELECT
cases.case_row_id,
cases.case_name,
cases.court_name,
attorneys.name,
attorneys.contactinfo
FROM
`bigquery-public-data.uspto_oce_litigation.cases` AS cases
JOIN
`bigquery-public-data.uspto_oce_litigation.attorneys` AS attorneys
ON
cases.case_row_id = attorneys.case_row_id
WHERE
cases.case_name IS NOT NULL )
SELECT
cases_and_attorneys.case_row_id,
cases_and_attorneys.name,
cases_and_attorneys.contactinfo,
court_name,
short_description
FROM
`bigquery-public-data.uspto_oce_litigation.documents` AS docs
JOIN
cases_and_attorneys
ON
docs.case_row_id = cases_and_attorneys.case_row_id
WHERE
short_description IS NOT NULL
AND upload_date IS NOT NULL
Are there lessons I can learn from the structure of these tables to avoid naively joining by table size? Thanks!
The reason your first query takes so long is because of this line:
docs.case_row_id = docs.case_row_id
The "join larger tables to smaller tables" advice is good - but in my experience I would not try to optimize too early. All of these tables are "small" in the grand scheme of things and changing their order in the join doesn't have much effect. When you see such a dramatic difference between execution times (>10x) I would double check that you aren't doing something like that self join.