Search code examples
sqlamazon-redshift

Best way to get 1st record per partition: FIRST_VALUE vs ROW_NUMBER


I am looking for the fastest way to get the 1st record (columns a,b,c ) for every partition (a,b) using SQL. Table is ~10, 000, 000 rows.

Approach #1:

SELECT * FROM (
    SELECT a,b,c, 
    ROW_NUMBER() OVER ( PARTITION by a, b ORDER BY date DESC) as row_num
    FROM T 
) WHERE row_num =1

But it probably does extra work behind the scene - I need only 1st row per partition.

Approach #2 using FIRST_VALUE(). Since FIRST_VALUE() returns expression let pack/concatenate a,b,c using some separator into single expression, e.g.:

SELECT FIRST_VALUE(a+','+'b'+','+c) 
OVER ( PARTITION by a, b ORDER BY date  DESC rows unbounded preceding) FROM T

But in this case I need to unpack the result, which is extra step.

Approach #3 using FIRST_VALUE() - repeat OVER (...) for a , b :

SELECT 
FIRST_VALUE(a) 
OVER ( PARTITION by a, b ORDER BY date  DESC rows unbounded preceding),
FIRST_VALUE(b) 
OVER ( PARTITION by a, b ORDER BY date  DESC rows unbounded preceding),
c 
FROM T

In approach #3 I do not know if database engine (Redshift) smart enough to partition only once


Solution

  • The first query is different from the other two. The first only returns one row per group. The other two return the same rows as in the original query.

    You should use the version that does what you want, which I presume is the first one. If you add select distinct or group by to the other queries, that will probably add overhead that will make them slower -- but you can test on your data to see if that is true.

    Your intuition is correct that the first query does unnecessary work. In databases that support indexes fully, a correlated subquery is often faster. I don't think that would be the case in Redshift, however.