I have a query like this:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date_calc AS (
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-05'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date_calc) AND DATE_ADD((SELECT * FROM min_date_calc), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
This query either doesn't run, or doesn't use the partitions to prune the data queried - scanning the whole table instead. What can I do here?
(question based on a comment on reddit)
This query will only scan 2.86GB:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN '2018-01-01' AND '2018-01-08' # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
But that's because I had the dates as a constant. Similarly, a query like this will perform well too:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date AS (
SELECT DATE('2018-01-01')
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date) AND DATE_ADD((SELECT * FROM min_date), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
This works well, as the date has been made a constant (even when inside a function).
To get these same benefits with a value which is not a constant, then we can use scripting to get all of the benefits:
DECLARE min_date DATE;
SET min_date = (
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-04'
);
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN min_date AND DATE_ADD(min_date, INTERVAL 7 DAY) # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
Now we see two queries:
min_date
.min_date
comes out of a dynamic query.