Search code examples
sqldatetimewindow-functionsimpala

Sum column values over a window based on variable date range (impala)


Given a table as follows :

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

I am looking for an efficient way to sum the number of connections that occur within x number of days following the current row (the current row being included in the sum), partitioned by client_id.

If x=6 then the output table would result in :

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

Concerns :

  1. I do not want to add all missing dates and then perform a sliding window counting the x following rows because my table is already extremely large.

  2. I am using Impala and the range between interval 'x' days following and current row is not supported.


Solution

  • The generic solution is a bit troublesome for multiple periods, but you can use multiple CTEs to support that. The idea is to "unpivot" the counts based on when they go in and out and then use a cumulative sum.

    So:

    with conn as (
          select client_id, date, connections
          from t
          union all
          select client_id, date + interval 7 day, -connections
          from t
         ),
         conn1 as (
          select client_id, date,
                 sum(sum(connections)) over (partition by client_id order by date) as connections_within_6_days
          from t
          group by client_id, date
         )
    select t.*, conn1. connections_within_6_days
    from t join
         conn1
         on conn1.client_id = t.client_id and
            conn1.date = t.date;