Search code examples
randomhivegroup-by

Select single random sample from group by in Hive


I have a table that looks like so:

Name      Age       Num_Hobbies     Num Shoes
Jane      31        10              2
Bob       23        3               4
Jane      60        2               200
Jane      31        100             6
Bob       10        8               7
etc etc

I would like to group this table by Name and Age, and at random pick one row from the rest of the columns.

In pandas, I would do the following:

df.groupby(['Name', 'Age']).apply(lambda x: x.sample(n=1))

In hive, I know how to create the group, but not how to choose a single random sample from group.

I saw this question on stack overflow: How to sample for each group in hive?

However, I do not understand how to apply Dynamic partitions or Hive bucketing to select a single sample from a group.


Solution

  • You can use rank() or row_number() with rand()

    select * from 
    (
           select name,age,rank() (partition by name,age order by rand()) as rank
           from table         
    ) t 
    where rank = 1