I have a dataset with booking hotels. date_in has format "yyyy-MM-dd". I need select top 10 the most visited hotel by month.
SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits
FROM (
SELECT date_ci, hotel_id, COUNT(id) AS count_visits,
RANK() OVER (
PARTITION BY date_ci, hotel_id ORDER BY COUNT(id) DESC) as rank
FROM (
SELECT id, hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci
FROM my_database.my_tab) x
) top_visits
GROUP BY date_ci, hotel_id HAVING rank <= 10;
I get the following error:
Error: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:13 Expression not in GROUP BY key 'hotel_id'
Move COUNT(id) aggregation into the subquery, add group by:
SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits
FROM (
SELECT date_ci, hotel_id, count_visits,
RANK() OVER (PARTITION BY date_ci, hotel_id ORDER BY count_visits DESC) as rank
FROM (
SELECT hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci,
COUNT(id) AS count_visits
FROM my_database.my_tab
GROUP BY hotel_id, SUBSTRING(my_tab.date_in, 1, 7)
) x
) top_visits
WHERE rank <= 10;