Search code examples
hivehiveqltop-n

Finding top n-th occurrences in group, Hive


I have a table where each record have columns: title and category. I want to find 2 titles with most occurrences in their category. Some titles are listed in both categories. How can this be achieved in Hive?

Here is a table creation query:

create table book(category String, title String) row format delimited fields terminated by '\t' stored as textfile;

And example data:

fiction book1
fiction book2
fiction book3
fiction book4
fiction book5
fiction book6
fiction book7
fiction book8
fiction book8
fiction book8
psychology  book1
psychology  book2
psychology  book2
psychology  book2
psychology  book2
psychology  book7
psychology  book7
psychology  book7

Expected result:

fiction book8
fiction any other
psychology  book2
psychology  book7

Currently I've managed to write this query:

SELECT * FROM  
(SELECT category, title,
             count(*) as sale_count
             from book
             Group BY category, title) a 
order by category, sale_count DESC; 

That gives count for a title in each category but I can't find the way to return only 2 top records from each category


Solution

  • For only two top records use row_number()

    select category, title, sale_count
    from
    (
    SELECT a.*,
    row_number() over(partition by category order by sale_count desc) rn
     FROM  
    (SELECT category, title,
                 count(*) as sale_count
                 from book
                 Group BY category, title) a 
    )s where rn <=2
    
    order by category, sale_count DESC; 
    

    and if there are more than one row with the same top sales and you need to return all top sales rows for two top counts, use DENSE_RANK instead of row_number, it will assign same rank if there are titles with the same sale_count.