Search code examples
google-bigqueryquery-optimization

BigQuery comparative performance with naive ordering by table size


The general advice for optimising BigQuery joins from the docs is as follows

As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

I'm still learning BigQuery so am exploring the public datasets, specifically the bigquery-public-data.uspto_oce_litigation dataset. The code below only requires the public dataset.

Naively following the advice results in a query that doesn't finish within 2mins of elapsed time

WITH
  docs_and_cases AS (
  SELECT
    cases.case_name,
    cases.court_name,
    docs.case_row_id,
    docs.short_description
  FROM
    `bigquery-public-data.uspto_oce_litigation.documents` AS docs
  JOIN
    `bigquery-public-data.uspto_oce_litigation.cases` AS cases
  ON
    docs.case_row_id = docs.case_row_id
  WHERE
    docs.short_description IS NOT NULL
    AND cases.case_name IS NOT NULL
    AND docs.upload_date IS NOT NULL )
SELECT
  docs_and_cases.case_row_id,
  attorneys.name,
  attorneys.contactinfo,
  docs_and_cases.court_name,
  docs_and_cases.short_description,
FROM
  docs_and_cases
JOIN
  `bigquery-public-data.uspto_oce_litigation.attorneys` AS attorneys
ON
  docs_and_cases.case_row_id = attorneys.case_row_id

Before I found this advice I constructed the following query, which completes in less than 20secs of elapsed time but is running almost completely counter to the advice on table order.

WITH
  cases_and_attorneys AS (
  SELECT
    cases.case_row_id,
    cases.case_name,
    cases.court_name,
    attorneys.name,
    attorneys.contactinfo
  FROM
    `bigquery-public-data.uspto_oce_litigation.cases` AS cases
  JOIN
    `bigquery-public-data.uspto_oce_litigation.attorneys` AS attorneys
  ON
    cases.case_row_id = attorneys.case_row_id
  WHERE
    cases.case_name IS NOT NULL )
SELECT
  cases_and_attorneys.case_row_id,
  cases_and_attorneys.name,
  cases_and_attorneys.contactinfo,
  court_name,
  short_description
FROM
  `bigquery-public-data.uspto_oce_litigation.documents` AS docs
JOIN
  cases_and_attorneys
ON
  docs.case_row_id = cases_and_attorneys.case_row_id
WHERE
  short_description IS NOT NULL
  AND upload_date IS NOT NULL

Are there lessons I can learn from the structure of these tables to avoid naively joining by table size? Thanks!


Solution

  • The reason your first query takes so long is because of this line:

    docs.case_row_id = docs.case_row_id
    

    The "join larger tables to smaller tables" advice is good - but in my experience I would not try to optimize too early. All of these tables are "small" in the grand scheme of things and changing their order in the join doesn't have much effect. When you see such a dramatic difference between execution times (>10x) I would double check that you aren't doing something like that self join.