Hello I was learning WINDOWING functionality of Hive, and came across a problem.
I was trying to find the number of customer in a month:
my_table:
I do a partition on 3 variables: the year / month of acquisition and rate_plan
SELECT (first_value(stock) OVER w + sum(incomers) OVER w) AS stock_monthly,
year(date_in_out) AS year_in,
month(date_in_out) AS month_in,
rate_plan_name
FROM my_table
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out) ORDER BY date_in_out ASC);
I got the result
I get different monthly_stock values, whereas year_in / month_in and rate_plan_name are the same in my dataset.
My question is why is this value different ? I would expect it to the same here.
With an order by date_in_out
in the window
specification, sum
gets computed for every row. If you need it aggregated at a year month level, use
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))
But note that first_value
still needs an order by
.
I think you are looking for,
SELECT first_value(stock) OVER(w ORDER BY date_in_out) + sum(incomers) OVER w AS stock_monthly,
year(date_in_out) AS year_in,
month(date_in_out) AS month_in,
rate_plan_name
FROM my_table
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))