mysql sql apache-spark-sql amazon-redshift-spectrum

Join the table with incremental data of the same table

I am trying to implement a logic in Redshift Spectrum where my original table looks like below:

Records in the student table:

1 || student1 || Boston   || 2019-01-01  
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago  || 2019-03-01 
1 || student1 || Dallas   || 2019-03-01

Records in the incremental table studentinc looks like below:

1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

By joining both student and studentinc tables, I am trying to get the latest set of records which should look like below:

2 || student2 || New York  || 2019-02-01
3 || student3 || Chicago   || 2019-03-01
1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

I have got this solution by doing UNION of both student and studentinc, then querying the result of union based on max(modified_ts). However, this solution isn't good for huge tables, is there a better solution which works by joining both the tables?

Solution

I would recommend window functions:

select s.*
from (select s.*,
             row_number() over (partition by studentid order by date desc) as seqnum
      from ((select s.* from student
            ) union all
            (select i.* from incremental
             from incremental
            )
           ) s
     ) s
where seqnum = 1;

Note: The union all requires that the columns be exactly the same and in the same order. You may need to list out the columns if they are not the same.