Search code examples
hivewindow-functions

Creating new column in Hive based on previous value of another column


I have data that looks like this:

Status Order
NULL   1
NULL   2
1      3
NULL   4
NULL   5
0      6
NULL   7
NULL   8
NULL   9
1      10

I am trying to add a new column that fills in the nulls with the previous number, such as:

Status     New
NULL       NULL
NULL       NULL
1          1
NULL       1
NULL       1
0          0
NULL       0
NULL       0
NULL       0
1          1

I am using hive, and do have access to pyspark, but was trying to solve through SQL if possible. I have looked into window functions (Rank, Dense_rank, Row_number, Lag, Lead). Any suggestions?


Solution

  • What you need are equivalence groups. Then you can use window functions to fill in the data. A group can be defined as the number of non-null values up to an including a given value.

    This assumes that you have a column that specifies the ordering.

    So:

    select t.*, max(status) over (partition by grp) as new_status
    from (select t.*, count(status) over (order by ?) as grp
          from t
         ) t;
    

    The ? is for the column that specifies the ordering.