Search code examples
sqlhadoophivehiveql

Identify the second max value in hive based on condition


I have a table with rows that looks like that with a column that rank all rows partition by ticket id over timestamp desc.

All rows can only have one flag equal to one.

ticketID  |  flag 1  | flag 2 | flag 3 | flag 4 | Timestamp  |  Rank    |  stringvalue  |  
----------------------------------------------------------------------------------------|
   1      |    0     |    0   |    1   |    0   |  xxxxxx    |    2     |   aaaaaa      |
   1      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   bbbbbb      |
   1      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   aaaaaa      |
   2      |    1     |    0   |    0   |    0   |  xxxxxx    |    2     |   bbbbbb      |
   2      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   xxxxxx      |
   3      |    0     |    0   |    1   |    0   |  xxxxxx    |    4     |   aaaaaa      |
   3      |    0     |    1   |    0   |    0   |  xxxxxx    |    3     |   bbbbbb      |
   3      |    1     |    0   |    0   |    0   |  xxxxxx    |    1     |   ssssss      |
   3      |    0     |    0   |    0   |    1   |  xxxxxx    |    2     |   nnnnnn      |
   4      |    0     |    1   |    0   |    0   |  xxxxxx    |    2     |   gggggg      |
   4      |    0     |    0   |    0   |    1   |  xxxxxx    |    1     |   iiiiii      |

for each ticketID i need to get the first row based on the rank but with a an exception for a specific flag :

when the rank 1 of a ticket is a row with flag 4 = 1 then i need to take the second rank position as the first one. And if the second rank of the ticket is flag 3 = 1 then i need to concatenate stringvalue from the first rank (flag = 4) with the second rank (flag = 3).

If the second rank is flag = 1 or flag = 2 then just forget about the first rank and return the second one as the first.

I hope that my question is clear.

Thanks

Edit

Sample output :

----------------------------------------------------------------------------------------
ticketID  |  flag 1  | flag 2 | flag 3 | Timestamp  |  Rank    |  stringvalue          |  
---------------------------------------------------------------------------------------|
   1      |    0     |    0   |    1   |  xxxxxx    |    1     |   aaaaaa / bbbbbbb    |
   2      |    1     |    0   |    0   |  xxxxxx    |    1     |        bbbbbb         |
   3      |    1     |    0   |    0   |  xxxxxx    |    1     |        ssssss         |
   4      |    0     |    1   |    0   |  xxxxxx    |    1     |        gggggg         |
---------------------------------------------------------------------------------------- 

Solution

  • I'm going to use some sub-queries with a struct group by. This will allow us to ask questions about multiple rows without using a window. Likely will perform faster as we don't have to maintain window state.

    create table theRanks (ticketID int, flag_1 int, flag_2 int, flag_3 int, flag_4 int, Timestamp string, Rank int, stringvalue string)
    -- create some dummy data
    insert into theRanks values ( 1 , 0, 0, 1, 0, 'xxxxxx', 2, 'aaaaaa')
    insert into theRanks values ( 1 , 0, 0, 0, 1, 'xxxxxx', 1, 'bbbbbb')
    insert into theRanks values ( 1 , 0, 1, 0, 0, 'xxxxxx', 3, 'aaaaaa')
    
    with stuct_table as -- sub-query syntax
    ( 
      select 
       ticketID, 
       struct( -- struct will allow us to group rows together.
        Rank as rawRank, -- this has to be first in strut as we use it for sorting
        flag_1 , 
        flag_2, 
        flag_3, 
        flag_4 , 
        Timestamp , 
        stringvalue 
       ) as myRow 
     from 
      theRanks 
     where 
      rank in (1,2) -- only look at first two ranks
    ), 
    constants as -- subquery
    ( 
     select 0 as rank1, 1 as rank2 -- strictly not needed just to help make it more readable 
    ), 
    grouped_rows as --subquery
    (
     select 
      ticketID, 
      array_sort(collect_list(myRow)) as row_list  -- will sort on rank all structs into a list
     from stuct_table 
     group by ticketID
    ) , 
    raw_rows as (select --sub-query styntax
     ticketId, 
     case 
      when 
       row_list[constants.rank2].flag_1 + row_list[constants.rank2].flag_2 > 0 or (row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 0 )
     then
       row_list[constants.rank2]
     when 
       row_list[constants.rank1].flag_4 = 1 and row_list[constants.rank2].flag_3  = 1 -- condition to concat string
     then
       struct( -- this struct must match the original one we created
        row_list[constants.rank2].rawRank as rawRank, 
        row_list[constants.rank2].flag_1 as flag_1,
        row_list[constants.rank2].flag_2 as flag_2,
        row_list[constants.rank2].flag_3 as flag_3,
        row_list[constants.rank2].flag_4 as flag_4,
        row_list[constants.rank2].Timestamp as Timestamp,
        concat(
          row_list[constants.rank1].stringvalue, 
          ' / ', 
          row_list[constants.rank2].stringvalue) as stringvalue
        )
     else
       row_list[constants.rank1]
     end as rankedRow,
     1 as Rank
    from grouped_rows
    cross join constants) -- not strictly needed, just replace all constants.rank1 with 0 and constants.rank2 with 1.  I just use it to make it more clear what I'm doing.  Could be replaced in production.
    select rankedRow.* , 1 as Rank from raw_rows; -- makes struct columns into table columns