Search code examples
sqldatetimewindow-functionsimpala

Sum column values over a window based on a week range (impala)


Given a table as follows :

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

I am looking for an efficient way to sum the number of connections that occur within 6 days following the current row (the current row being included in the sum), partitioned by client_id, which would result in :

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

Issues :

  1. I do not want to add all missing dates and then perform a sliding window counting the 7 following rows because my table is already extremely large.

  2. I am using Impala and the range between interval '7' days following and current row is not supported.


Edit : I am looking for a generic answer taking into account the fact that I will need to change the window size to larger numbers (30+ days for example)


Solution

  • This answers the original version of the question.

    Impala doesn't fully support range between. Unfortunately, that doesn't leave many options. One is to use lag() with lots of explicit logic:

    select t.*,
           ( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 6) over (partition by client_id order by date)
                   else 0
              end) +
             (case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 5) over (partition by client_id order by date)
                   else 0
              end) +
             (case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 4) over (partition by client_id order by date)
                   else 0
              end) +
             (case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 3) over (partition by client_id order by date)
                   else 0
              end) +
             (case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 2) over (partition by client_id order by date)
                   else 0
              end) +
             (case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
                   then lag(connections, 1) over (partition by client_id order by date)
                   else 0
              end) +
             connections
            ) as connections_within_6_days         
    from t;
    

    Unfortunately, this doesn't generalize very well. If you want a wide range of days, you might want to ask another question.