I have a consumer table like so.
consumer | product | quantity
-------- | ------- | --------
a | x | 3
a | y | 4
a | z | 1
b | x | 3
b | y | 5
c | x | 4
What I want is a 'normalized' rank assigned to each consumer so that I can split the table easily for testing and training. I used the dense_rank() in hive, so I got the below table.
rank | consumer | product | quantity
---- | -------- | ------- | --------
1 | a | x | 3
1 | a | y | 4
1 | a | z | 1
2 | b | x | 3
2 | b | y | 5
3 | c | x | 4
This is well and good, but I want to scale this to use with any number of consumers, so I would ideally like the range of ranks between 0 and 1, like so.
rank | consumer | product | quantity
---- | -------- | ------- | --------
0.33 | a | x | 3
0.33 | a | y | 4
0.33 | a | z | 1
0.67 | b | x | 3
0.67 | b | y | 5
1 | c | x | 4
This way, I'd always know what the range of ranks is, and can split the data in a standard way (rank <= 0.7 training, and rank > 0.7 testing)
Is there a way to achieve this in hive?
Or, is there a different and better approach to my original issue of splitting the data?
I tried to do a select * where rank < 0.7*max(rank)
, but hive says the MAX UDAF is not yet available in where clause.
percent_rank
select percent_rank() over (order by consumer) as pr
,*
from mytable
;
+-----+----------+---------+----------+
| pr | consumer | product | quantity |
+-----+----------+---------+----------+
| 0.0 | a | z | 1 |
| 0.0 | a | y | 4 |
| 0.0 | a | x | 3 |
| 0.6 | b | y | 5 |
| 0.6 | b | x | 3 |
| 1.0 | c | x | 4 |
+-----+----------+---------+----------+
For filtering you'll need a sub-query / CTE
select *
from (select percent_rank() over (order by consumer) as pr
,*
from mytable
) t
where pr <= ...
;