Search code examples
hadoophiveuniquecalculated-columnsunique-key

add a new column for unique ID in hive table


i have a table in hive with two columns: session_id and duration_time like this:

|| session_id || duration||

    1               14          
    1               10      
    1               20          
    1               10          
    1               12          
    1               16          
    1               8       
    2               9           
    2               6           
    2               30          
    2               22

i want to add a new column with unique id when:

the session_id is changing or the duration_time > 15

i want the output to be like this:

session_id      duration    unique_id
1               14          1
1               10          1
1               20          2
1               10          2
1               12          2
1               16          3
1               8           3
2               9           4
2               6           4
2               30          5
2               22          6

any ideas how to do that in hive QL?

thanks!


Solution

  • SQL tables represent unordered sets. You need a column specifying the ordering of the values, because you seem to care about the ordering. This could be an id column or a created-at column, for instance.

    You can do this using a cumulative sum:

    select t.*,
           sum(case when duration > 15 or seqnum = 1 then 1 else 0 end) over
               (order by ??) as unique_id
    from (select t.*,
                 row_number() over (partition by session_id order by ??) as seqnum
          from t
         ) t;