Search code examples
sqldatetimehiveimpalarolling-average

SQL Moving Average over specific time


I'm working with Impala and Hive, so I don't know if I have fancy functions that Oracle or MSQL provide

So , having a table of values and timestamps

 |-----------|------------------|
 |   value   |     timestamp    |
 |-----------|------------------|
 |     2     |    12:02:34      | 
 |-----------|------------------|
 |     5     |    12:06:30      |
 |-----------|------------------|
 |     2     |    13:01:33      |
 |-----------|------------------|
 |     2     |    13:04:00      |

I want to get an extra column with the rolling average for the last hour only

 |-----------|------------------|--------|
 |   value   |     timestamp    |  avg   |
 |-----------|------------------|--------|
 |     2     |    12:02:34      |   2    |
 |-----------|------------------|--------|
 |     5     |    12:06:30      |  3.5   |
 |-----------|------------------|--------|
 |     2     |    13:01:33      |   3    |
 |-----------|------------------|--------|
 |     2     |    13:04:00      |   3    |

There can be a variable number of values for each hour, so I cannot do a Window over a number of rows, because i don't know how many rows it implies.

Any suggestions?


Solution

  • This is a little tricky. Although Hive supports range window frames, it only supports them with numbers, not intervals.

    So, you need to convert the timestamp to a number and then use that:

    select t.*,
           avg(value) over (order by unix_timestamp(timestamp)
                            range between 3559 preceding and current row
                           )
    from t;
    

    3559 = 60 * 60 - 1, which is one second less than an hour. One second less is used because the window frame includes the current row.