I am learning Hive and wanted to write an optimized HiveQL/SQL query
My table looks like this:
CREATE TABLE sales (dealer VARCHAR(25), make VARCHAR(25), type VARCHAR(25), day INT);
INSERT INTO sales (dealer, make, type, day) VALUES
("Xyz", "Highlander", "SUV", "0"),
("Xyz", "Prius", "HATCH", "1"),
("Xyz", "Prius", "HATCH", "2"),
("Xyz", "Prius", "HATCH", "3"),
("Xyz", "Versa", "HATCH", "1"),
("Xyz", "Versa", "HATCH", "2"),
("Xyz", "Versa", "HATCH", "3"),
("Xyz", "S3", "SEDAN", "1"),
("Xyz", "S3", "SEDAN", "2"),
("Abc", "Forrester", "SUV", "1");
Given a "dealer" D, I want to compute the top N "make" for each "type" in the past X days, in a single query.
SELECT dealer, make, type, COUNT(*) AS frequency FROM sales
WHERE day > 0 AND dealer LIKE 'Xyz' GROUP BY make, type
ORDER BY frequency DESC LIMIT 5
The problem is when using GROUP BY on "make" and "type" for top 1, I will only get:
DEALER, MAKE, TYPE, COUNT
Xyz, Prius, Hatch, 3
Xyz, Versa, Hatch, 3
Xyz, S3, Sedan, 2
...
But I want
Xyz, Prius, Hatch, 3
Xyz, S3, Sedan, 2
...
for EACH "type" the top N.
Could someone help me understand how to write such a query?
SQL Fiddle http://sqlfiddle.com/#!2/df9304/5
****Update****
Seems like rank() would be useful:
Hive getting top n records in group by query
After reading some more docs and the hints from the linked questions:
SELECT dealer, make, rank, type FROM (
SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
) CountedSales
) RankedSales
WHERE RankedSales.rank < 3;
Inner query does counting, middle query performs rank() and the outer query limits on rank.
Sales table contents
hive> select * from Sales;
OK
Xyz Highlander SUV NULL
Xyz Highlander SUV NULL
Xyz Rouge SUV NULL
Xyz Rouge SUV NULL
Xyz Prius HATCH NULL
Xyz Prius HATCH NULL
Xyz Prius HATCH NULL
Xyz Versa HATCH NULL
Xyz S3 SEDAN NULL
Xyz S3 SEDAN NULL
Xyz S3 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Xyz A8 SEDAN NULL
Time taken: 0.054 seconds, Fetched: 15 row(s)
Now the actual query.
hive> SELECT dealer, make, rank, type FROM (
> SELECT dealer, make, rank() OVER (PARTITION BY type ORDER BY count DESC) AS rank, type FROM (
> SELECT dealer, make, count(*) AS count, type FROM Sales WHERE dealer = "Xyz" GROUP BY dealer, type, make
> ) CountedSales
> ) RankedSales
> WHERE RankedSales.rank < 3;
...
Execution completed successfully
MapredLocal task succeeded
OK
Xyz Prius 1 HATCH
Xyz Versa 2 HATCH
Xyz A8 1 SEDAN
Xyz S3 2 SEDAN
Xyz Rouge 1 SUV
Xyz Highlander 1 SUV
Time taken: 28.491 seconds, Fetched: 6 row(s)