I have a table where each record have columns: title and category. I want to find 2 titles with most occurrences in their category. Some titles are listed in both categories. How can this be achieved in Hive?
Here is a table creation query:
create table book(category String, title String) row format delimited fields terminated by '\t' stored as textfile;
And example data:
fiction book1
fiction book2
fiction book3
fiction book4
fiction book5
fiction book6
fiction book7
fiction book8
fiction book8
fiction book8
psychology book1
psychology book2
psychology book2
psychology book2
psychology book2
psychology book7
psychology book7
psychology book7
Expected result:
fiction book8
fiction any other
psychology book2
psychology book7
Currently I've managed to write this query:
SELECT * FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
order by category, sale_count DESC;
That gives count for a title in each category but I can't find the way to return only 2 top records from each category
For only two top records use row_number()
select category, title, sale_count
from
(
SELECT a.*,
row_number() over(partition by category order by sale_count desc) rn
FROM
(SELECT category, title,
count(*) as sale_count
from book
Group BY category, title) a
)s where rn <=2
order by category, sale_count DESC;
and if there are more than one row with the same top sales and you need to return all top sales rows for two top counts, use DENSE_RANK
instead of row_number
, it will assign same rank if there are titles with the same sale_count.