Search code examples
mysqlsqlapache-spark-sqlamazon-redshift-spectrum

Join the table with incremental data of the same table


I am trying to implement a logic in Redshift Spectrum where my original table looks like below:

Records in the student table:

1 || student1 || Boston   || 2019-01-01  
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago  || 2019-03-01 
1 || student1 || Dallas   || 2019-03-01

Records in the incremental table studentinc looks like below:

1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

By joining both student and studentinc tables, I am trying to get the latest set of records which should look like below:

2 || student2 || New York  || 2019-02-01
3 || student3 || Chicago   || 2019-03-01
1 || student1 || SFO       || 2019-04-01
4 || student4 || Detroit   || 2019-04-01

I have got this solution by doing UNION of both student and studentinc, then querying the result of union based on max(modified_ts). However, this solution isn't good for huge tables, is there a better solution which works by joining both the tables?


Solution

  • I would recommend window functions:

    select s.*
    from (select s.*,
                 row_number() over (partition by studentid order by date desc) as seqnum
          from ((select s.* from student
                ) union all
                (select i.* from incremental
                 from incremental
                )
               ) s
         ) s
    where seqnum = 1;
    

    Note: The union all requires that the columns be exactly the same and in the same order. You may need to list out the columns if they are not the same.