Given a Hive table like below I want to calculate a weighted moving average. In table 1 is an example of such a weighted moving average. The value of every column is
col_value = (1 * n) + (0.75 * n-1) + (0.5 * n-2) + (0.25 * n-3)
where n is the value at the current row, n-1 the value at the above row, etc.
More info on weighted moving averages on the corresponding Wikipedia section
Im stuck after calculating the moving average like so:
Code sample 1: Query so far
SELECT
*,
AVG (value) OVER (
ORDER BY
id
ROWS BETWEEN
3 PRECEDING AND CURRENT ROW
FROM
table
Table 1:
id value weighted_moving_average code_sample_1
...
11 0 0 0
12 1 1 0.25
13 0 0.75 0.25
14 0 0.5 0.25
15 0 0.25 0.25
16 0 0 0
...
If you want the window size to be easily changed, then the implementation would be substantially different.
(Please note that I developed this in sql-server because I am not aware of any website I can test hive SQL. You may need to tweak some syntax.)
Declare @thisManyRows int = 3;
With rowNumber_cte As (
Select
id,
[Value],
RowNo = Row_Number() Over (Order By id)
From Table1),
windows_cte As (
Select
b_id = b.id,
j_id = j.id,
j.[Value],
RowNo2 = @thisManyRows + 1 - Row_Number() Over (Partition By b.id Order By j.id desc),
n = Count(j.[Value]) Over (Partition By b.id)
From rowNumber_cte As b
Join rowNumber_cte As j On j.RowNo Between b.RowNo - @thisManyRows + 1 And b.RowNo)
Select
id = b_id,
[Value] = Sum(Case When b_id = j_id Then [Value] Else Null End),
WeightedAverage = Sum([Value] * 1.0 * (RowNo2 *1.0 / @thisManyRows)) / Min(n)
From windows_cte
Group By b_id;
I'm going to explain what this does because it's a bit cryptic:
rowNumber_cte
we assign a row number, we do this because we don't assume that id
will be sequential.windows_cte
we join the rowNumber_cte
on itself, so that each row is joined to all the rows that make up it's window of calculation (with reference to the @thisManyRows
variable. This cte also introduces a new RowNo2
field which will basically end up being the numerator of the fraction for which this row's value contributes to the weighted moving average (i.e. the current row, if the window is 4 will have RowNo2 = 4
and will have 4/4 weight).windows_cte
also calculates the number of rows in the window to better handle rows at the start of the data which don't have a full window.id
, multiplies the [Value]
of each row in the window by it's weight and then divides by the number of rows in the window: n
.