I have the following source data in SQL in GBQ:
SELECT 1 AS client_id, CAST('2022-10-13' AS DATE) AS session_date, 'denied' AS value,
UNION ALL
SELECT 1, CAST('2022-10-15' AS DATE), 'granted'
UNION ALL
SELECT 1, CAST('2022-10-18' AS DATE), 'denied'
UNION ALL
SELECT 2, CAST('2022-01-01' AS DATE), 'denied'
UNION ALL
SELECT 2, CAST('2022-01-05' AS DATE), 'granted'
UNION ALL
SELECT 3, CAST('2022-01-01' AS DATE), 'granted'
UNION ALL
SELECT 4, CAST('2022-01-03' AS DATE), 'granted'
And I want to get a table like this where I have the last activity date for each user with an struct that saves every activity session date and it's value (denied/granted):
I have used the following query to achieve this:
WITH max_date AS (
SELECT
client_id,
session_date,
value,
MAX(session_date) OVER (PARTITION BY client_id) AS last_activity
FROM
source_data )
SELECT
client_id,
last_activity,
ARRAY_AGG(STRUCT(session_date,
value)
ORDER BY
session_date) AS push_permission
FROM
max_date
GROUP BY
1,
2
The issue is that the query size is going from 11GB to 1.2TB. I don't understand why this is happening, as I have used the max() partition by over other times and this wasn't happening. Is there any way to fix this? It seems it's missing the partition for some reason.
Thanks!
I don't think that you need window functions here, aggregation is sufficient; the last activity date can be computed with a simple aggregate max()
.
Is the query less resource-consuming if we phrase it like this?
select client_id, max(session_date) as last_activity
array_agg(struct(session_date, value) order by session_date) as push_permission
from source_data
group by client_id
Note that the query still needs to scan and aggregate the whole table - given your requirement, this cannot be avoided.