Search code examples
sqlamazon-redshiftquery-optimizationgreatest-n-per-group

Effectively select latest row for each group in a very large table?


I have (for example's sake) a table Users (user_id, status, timestamp, ...).
I also have another table SpecialUsers (user_id, ...).

I need to show each special user's latest status.

The problem is that the Users table is VERY, VERY LARGE (more than 50 Billion rows). Most of the solutions in for instance this question just hang or get "disk full" error.

SpecialUsers table is much smaller - "only" 600K rows.

SELECT DISTINCT ON() is not supported. Working on Amazon RedShift.

EDIT: per request to see the failed attempts - one of those resulting in the disk full error is like this:

with users_with_status (user_id, status, timestamp)
as (
        select su.user_id, u.instance_type, u.timestamp
        from specialusers su
        join users u on su.user_id = u.user_id
)
select l.instance_id, l.instance_type
from users_with_status l
left outer join users_with_status r
     on l.user_id = r.user_id and l.timestamp < r.timestamp
where r.timestamp is null;

I know that I'm joining a bug table with itself but was hoping that the first join with small table would reduce the number of processed rows.

Anyway, seems that window functions is the solution here.


Solution

  • Perhaps a join with a window function will work:

    select su.*
    from (select s.user_id, u.status, u.timestamp,
                 max(u.timestamp) over (partition by s.user_id) as max_timestamp
          from specialusers s join
               users u
               on s.user_id = u.user_id
         ) su
    where timestamp = max_timestamp;
    

    This specifically uses max() instead of row_number() on the speculation that it might use slightly fewer resources.