I am trying to implement a logic in Redshift Spectrum where my original table looks like below:
Records in the student table:
1 || student1 || Boston || 2019-01-01
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || Dallas || 2019-03-01
Records in the incremental table studentinc looks like below:
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
By joining both student and studentinc tables, I am trying to get the latest set of records which should look like below:
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
I have got this solution by doing UNION of both student and studentinc, then querying the result of union based on max(modified_ts). However, this solution isn't good for huge tables, is there a better solution which works by joining both the tables?
I would recommend window functions:
select s.*
from (select s.*,
row_number() over (partition by studentid order by date desc) as seqnum
from ((select s.* from student
) union all
(select i.* from incremental
from incremental
)
) s
) s
where seqnum = 1;
Note: The union all
requires that the columns be exactly the same and in the same order. You may need to list out the columns if they are not the same.