Search code examples
hivewindowing

Hive Windowing : distinct results on partition


Hello I was learning WINDOWING functionality of Hive, and came across a problem.

I was trying to find the number of customer in a month:

my_table:

  • date_in_out: date of acquisition
  • rate_plan_name : string
  • stock: int
  • incomers: int

I do a partition on 3 variables: the year / month of acquisition and rate_plan


SELECT (first_value(stock) OVER w + sum(incomers) OVER w) AS stock_monthly,
year(date_in_out) AS year_in,
month(date_in_out) AS month_in,
rate_plan_name
FROM my_table
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out) ORDER BY date_in_out ASC);

I got the result

enter image description here

I get different monthly_stock values, whereas year_in / month_in and rate_plan_name are the same in my dataset.

My question is why is this value different ? I would expect it to the same here.


Solution

  • With an order by date_in_out in the window specification, sum gets computed for every row. If you need it aggregated at a year month level, use

    WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))
    

    But note that first_value still needs an order by.

    I think you are looking for,

    SELECT first_value(stock) OVER(w ORDER BY date_in_out) + sum(incomers) OVER w AS stock_monthly,
    year(date_in_out) AS year_in,
    month(date_in_out) AS month_in,
    rate_plan_name
    FROM my_table
    WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))