Search code examples
sqlapache-spark-sqlwindow-functions

How to get previous non equal salary record in the current row in spark sql


id startdate enddate salary
1 2015-09-07 9999-12-31 194000
1 2015-03-01 2015-09-06 194000
1 2014-04-10 2015-02-28 194000
1 2014-01-01 2014-04-09 192000
1 2013-07-31 2013-12-31 180000

This is the table I have, I need to introduce a new column which shows the previous salary for the employee. The challenge that I have is that if the salary is unchanged, it cannot be shown as the previous salary, so in this case the result table would be

id startdate enddate salary prevSalary
1 2015-09-07 9999-12-31 194000 192000
1 2015-03-01 2015-09-06 194000 192000
1 2014-04-10 2015-02-28 194000 192000
1 2014-01-01 2014-04-09 192000 180000
1 2013-07-31 2013-12-31 180000 null (or) 0

I tried using the "lag" operator but it doesn't give me the desired output, instead it just picks the salary from the last record which is incorrect.

My query:

select *, lag(salary) 
    over(partition by id, order by startdate) as prevSalary 
    from tablename.

I also tried partitioning it with "id" and "salary" but I am not able to formulate the proper solution.

Note, there are multiple ids in the table, I am using one id just to give an example. Also the date records are consistent and have no gaps.


Solution

  • Once you selected the previous salary with LAG, in the same partition of salary, there will be the first one that will hold the good result, and the other ones for which salary = lastsalary. You can just apply the FIRST_VALUE window function in the very same salary partition, to overwrite the needed lastsalary.

    WITH cte AS (
        SELECT *, LAG(salary) OVER(PARTITION BY id ORDER BY startdate) AS lastsalary
        FROM tab
    )
    SELECT id, startdate, enddate, salary, 
           FIRST_VALUE(lastsalary) OVER(PARTITION BY id, salary ORDER BY startdate ROWS UNBOUNDED PRECEDING) AS lastsalary 
    FROM cte
    

    Output:

    id startdate enddate salary lastsalary
    1 2015-09-07 9999-12-31 194000 192000
    1 2015-03-01 2015-09-06 194000 192000
    1 2014-04-10 2015-02-28 194000 192000
    1 2014-01-01 2014-04-09 192000 180000
    1 2013-07-31 2013-12-31 180000 null

    Edit: On ahmed's smart suggestion, if salary is not monotonically increasing, and the id happen to have a previously occurring salary, you could need a solution that works in gaps-and-islands setting, for which you would need to rebuild your partitioning as follows:

    WITH gaps AS (
        SELECT *,
               ROW_NUMBER() OVER(PARTITION BY id ORDER BY startdate DESC) - 
                 ROW_NUMBER() OVER(PARTITION BY id, salary ORDER BY startdate DESC) grp
        FROM tab
    ), cte AS (
        SELECT *, LAG(salary) OVER(PARTITION BY id ORDER BY startdate) AS lastsalary
        FROM gaps
    )
    SELECT *,
           FIRST_VALUE(lastsalary) OVER(PARTITION BY id, grp ORDER BY startdate ROWS UNBOUNDED PRECEDING) AS lastsalary 
    FROM cte 
    ORDER BY startdate DESC