Search code examples
sqlarraysgoogle-bigqueryaggregate-functionswindow-functions

How to get the most recent n records per group and aggregate them into an array


I know that this is a common problem, and I've seen here and here but neither of these solutions are working for me.

I have a table with fields session_time, user and email (these fields are just dummy placeholders, my real schema is different so please don't look too much into their names) supposing each user can have various emails related to it, I need to output, per row:

1 - a user

2 - an array consisting of every email related to the user

However, some users have WAY too many emails, so I need to limit them to the most recent entries, lets say 2 most recent per user.

WITH
  sample_data AS (
  SELECT
    '2023-06-12 10:00:00' AS session_time,
    '1' AS user,
    'example@example.com' AS email
  UNION ALL
  SELECT
    '2023-06-12 11:00:00' AS session_time,
    '2' AS user,
    'example@example.com' AS email
  UNION ALL
  SELECT
    '2023-06-12 12:00:00' AS session_time,
    '3' AS user,
        'example@example.com' AS email,
  UNION ALL
  SELECT
  '2023-06-12 13:00:00' AS session_time,
  '3' AS user,
  'example2@example.com' AS email
  UNION ALL
  SELECT
  '2023-06-12 14:00:00' AS session_time,
  '3' AS user,
  'example3@example.com' AS email
)

, filtered_user_email AS (
SELECT
  user,
  email,
  ROW_NUMBER() OVER(PARTITION BY user, email ORDER BY MAX(session_time) DESC) row_num
FROM
  sample_data
GROUP BY
  uuid,
  user
)

running SELECT user, array_agg(distinct email) as emails FROM filtered_user_email WHERE row_num <= 2 group by user outputs every row, because every row gets a row_num of 1, so for user 3, I get 3 entries instead of just the two most recent ones.

How can I then just select the 2 most recent emails per user?


Solution

  • Presumably, you just need to change the partition of your row_number:

    select usr, array_agg(email) emails
    from (
        select usr, email, 
            row_number() over(partition by usr order by max(session_time))rn
        from sample_data s
        group by usr, email
    ) t
    where rn <= 2
    group by usr