Search code examples
sqlpostgresqlinner-joinwindow-functions

Which one is quicker/optimized - Inner Join or Partition By - to obtain Aggregated data?


In my data 'table1', there are multiple records for each app_id. I'm trying to get a latest app_id entry. Which one of the below queries would be quick or better to consider in terms of performance and memory usage? Currently, I work on AWS server and use PostgreSQL.

Query 1:

SELECT b.* 
FROM 
    (SELECT app_id, max(datetime) as datetime
    from table1
    group by 1) a
INNER JOIN 
    (SELECT * 
    from table1) b
ON a.app_id = b.app_id AND a.datetime = b.datetime

Query 2:

SELECT * 
FROM 
    (SELECT *, row_number() over (partition by app_id order by datetime DESC 
    NULLS LAST) as num FROM table1) sub
WHERE sub.num=1

Solution

  • The right answer to "which is faster" is to try the queries on your data and your systems.

    That said, there are some considerations in favor of row_number(). In particular, window functions are not an "accidental" feature in databases. Adding a new string function is just a function and the function may or may not be optimized.

    On the other hand, window functions required rewriting/redesigning some fundamental components of the database engine. In general, this was done with performance in mind. So, I usually find that window functions are faster than equivalent constructs.

    The only exception that I regularly find (across databases) ironically applies in your case. And, it is not using the join and group by. Instead it is:

    select t1.*
    from table1 t1
    where t1.datetime = (select max(tt1.datetime)
                         from table1 tt1
                         where tt1.app_id = t1.app_id
                        );
    

    Along with an index on table1(app_id, datetime).

    The basic reason for the performance improvement is that this scans table1 once and does an index lookup at each row.

    The join/group by is scanning the table multiple times and the aggregation is expensive. The row_number() version scans the table (or index), calculates the value, and then brings the value back to every row -- similar to two scans of the data.