Search code examples
google-bigquerydatabase-partitioning

How can I get the benefits from a BigQuery partitioned table when the values to filter that column come out of a subquery?


I have a query like this:

WITH data AS (
  SELECT *
  FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
  WHERE wiki='en'
  AND title LIKE 'Goo%'
), min_date_calc AS (
  SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-05'
)


SELECT title, SUM(views) views
FROM data
WHERE 
  DATE(datehour) BETWEEN (SELECT * FROM min_date_calc) AND DATE_ADD((SELECT * FROM min_date_calc), INTERVAL 7 DAY) 
GROUP BY 1 ORDER BY 2 DESC LIMIT 1

This query either doesn't run, or doesn't use the partitions to prune the data queried - scanning the whole table instead. What can I do here?

(question based on a comment on reddit)


Solution

  • This query will only scan 2.86GB:

    WITH data AS (
      SELECT *
      FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
      WHERE wiki='en'
      AND title LIKE 'Goo%'
    )
    
    SELECT title, SUM(views) views
    FROM data
    WHERE 
      DATE(datehour) BETWEEN '2018-01-01' AND '2018-01-08' # 2.68 GB 
    GROUP BY 1 ORDER BY 2 DESC LIMIT 1
    

    But that's because I had the dates as a constant. Similarly, a query like this will perform well too:

    WITH data AS (
      SELECT *
      FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
      WHERE wiki='en'
      AND title LIKE 'Goo%'
    ), min_date AS (
        SELECT DATE('2018-01-01')
    )
    
    
    SELECT title, SUM(views) views
    FROM data
    WHERE 
      DATE(datehour) BETWEEN (SELECT * FROM min_date) AND DATE_ADD((SELECT * FROM min_date), INTERVAL 7 DAY) 
    GROUP BY 1 ORDER BY 2 DESC LIMIT 1
    

    This works well, as the date has been made a constant (even when inside a function).

    To get these same benefits with a value which is not a constant, then we can use scripting to get all of the benefits:

    DECLARE min_date DATE;
    
    SET min_date = (
      WITH data AS (
        SELECT *
        FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
        WHERE wiki='en'
        AND title LIKE 'Goo%'
      )
      SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-04'
    );
    
    WITH data AS (
      SELECT *
      FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
      WHERE wiki='en'
      AND title LIKE 'Goo%'  
    )
    
    SELECT title, SUM(views) views
    FROM data
    WHERE 
      DATE(datehour) BETWEEN min_date AND DATE_ADD(min_date, INTERVAL 7 DAY) # 2.68 GB 
    GROUP BY 1 ORDER BY 2 DESC LIMIT 1
    

    Now we see two queries:

    • The first one goes as over as much data as it needs to set the value of min_date.
    • The second one uses 2.86GB as the previous did too - but now the value of min_date comes out of a dynamic query.