Search code examples
sqlgoogle-bigquerysubqueryaggregate-functionsanalytic-functions

Bigquery SQL : When to use Aggregated Function over Analytics Function over Subquery?


Let's say i have a table with Users :

user_id
abc
def

And a table of Purchases :

purchase_id purchase_date status user_id
1 2020-01-01 sold abc
2 2020-02-01 refunded abc
3 2020-03-01 sold def
4 2020-04-01 sold def
5 2020-05-01 sold def

I want the status of the last purchase for each user, which would be :

user_id last_purchase_date status
abc 2020-02-01 refunded
def 2020-05-01 sold

I am wondering what is the best approach (in term of performance, cost, readability ...) between those three queries that give the same results :

Aggregated Function

SELECT 
  user_id,
  MAX(purchase_date) as last_purchase_date,
  ARRAY_AGG(status ORDER BY purchase_date DESC LIMIT 1)[SAFE_OFFSET(0)] as last_status
FROM user
LEFT JOIN purchase USING (user_id)
GROUP BY user_id

Analytics Function

SELECT
  DISTINCT
  user_id,
  MAX(purchase_date) OVER (PARTITION BY user_id) as last_purchase_date,
  LAST_VALUE(status) OVER (PARTITION BY user_id ORDER BY purchase_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as last_status,
FROM user
LEFT JOIN purchase USING (user_id)

Subquery

SELECT 
  user_id,
  purchase_date as last_purchase_date,
  status as last_status
FROM user
LEFT JOIN purchase USING (user_id)
WHERE purchase_date IN (
  SELECT 
    MAX(purchase_date) as purchase_date
  FROM purchase
  GROUP BY user_id
)

Here is the Dataset for those who want it :

WITH purchase as (
  SELECT 1 as purchase_id, "2020-01-01" as purchase_date, "sold" as status, "abc" as user_id
  UNION ALL SELECT 2 as purchase_id, "2020-02-01" as purchase_date, "refunded" as status, "abc" as user_id
  UNION ALL SELECT 3 as purchase_id, "2020-03-01" as purchase_date, "sold" as status, "def" as user_id
  UNION ALL SELECT 4 as purchase_id, "2020-04-01" as purchase_date, "sold" as status, "def" as user_id
  UNION ALL SELECT 5 as purchase_id, "2020-05-01" as purchase_date, "sold" as status, "def" as user_id
), user as (
    SELECT "abc" as user_id,
    UNION ALL SELECT "def" as user_id,
)

Solution

  • The answer is pretty simple: always use the most performing and cheapest approach.

    When you run your code snippets with the provided test data you can see in the execution details which one is better for this use case:

    Aggregated Function enter image description here

    Analytics Function enter image description here

    As you can see the aggregated function consumed around 3 times less slot time and shuffled less bytes than the analytics function. That means for this use case it is much better to use your aggregate function then the analytical one.

    The power of analytical functions comes when you have use cases where you cannot use aggregate functions and can only be solved with analytical functions. Analytical functions have a wider range of application than other types of functions.