Search code examples
hadoopmachine-learninghivetraining-data

Is it possible to do a 'normalized' dense_rank() in hive?


I have a consumer table like so.

consumer | product | quantity
-------- | ------- | --------
a        | x       | 3
a        | y       | 4
a        | z       | 1
b        | x       | 3
b        | y       | 5
c        | x       | 4

What I want is a 'normalized' rank assigned to each consumer so that I can split the table easily for testing and training. I used the dense_rank() in hive, so I got the below table.

rank | consumer | product | quantity
---- | -------- | ------- | --------
1    | a        | x       | 3
1    | a        | y       | 4
1    | a        | z       | 1
2    | b        | x       | 3
2    | b        | y       | 5
3    | c        | x       | 4

This is well and good, but I want to scale this to use with any number of consumers, so I would ideally like the range of ranks between 0 and 1, like so.

rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a        | x       | 3
0.33 | a        | y       | 4
0.33 | a        | z       | 1
0.67 | b        | x       | 3
0.67 | b        | y       | 5
1    | c        | x       | 4

This way, I'd always know what the range of ranks is, and can split the data in a standard way (rank <= 0.7 training, and rank > 0.7 testing)

Is there a way to achieve this in hive?

Or, is there a different and better approach to my original issue of splitting the data?

I tried to do a select * where rank < 0.7*max(rank), but hive says the MAX UDAF is not yet available in where clause.


Solution

  • percent_rank

    select  percent_rank() over (order by consumer) as pr
           ,* 
    
    from    mytable
    ;
    

    +-----+----------+---------+----------+
    | pr  | consumer | product | quantity |
    +-----+----------+---------+----------+
    | 0.0 | a        | z       |        1 |
    | 0.0 | a        | y       |        4 |
    | 0.0 | a        | x       |        3 |
    | 0.6 | b        | y       |        5 |
    | 0.6 | b        | x       |        3 |
    | 1.0 | c        | x       |        4 |
    +-----+----------+---------+----------+
    

    For filtering you'll need a sub-query / CTE

    select  *
    
    from   (select  percent_rank() over (order by consumer) as pr
                   ,* 
    
            from    mytable
            ) t
    
    where   pr <= ...
    ;