Search code examples
sqlamazon-redshiftwindow-functions

Calc aggregates across continuous groups of values in Redshift


This is something that would probably be pretty easy to code a solution, but harder to accomplish in straight SQL. I may have to give up and code a routine that scans through the table.

I have a table of user status values with start and end dates that is like this:

create table #t (userid int4, status varchar(15), start_time date, end_time date);

insert into #t values
(1, 'Active', '2019-08-15', '2019-08-20'),
(1, 'Active', '2019-08-20', '2019-08-22'),
(1, 'Active', '2019-08-22', '2019-09-22'),
(1, 'Inactive', '2019-09-22', '2019-10-22'),
(1, 'At Risk', '2019-10-22', '2019-11-22'),
(1, 'Lapsed', '2019-11-22', '2019-12-08'),
(1, 'Active', '2019-12-08', '2019-12-18'),
(1, 'Active', '2019-12-18', '2020-01-11'),
(1, 'Active', '2020-01-11', '2020-01-15'),
(1, 'Active', '2020-01-15', '2020-02-15'),
(1, 'Inactive', '2020-02-15', '2020-03-15')
;

I'm trying to summarized to min/max dates for each continuous group of status values (when sorted by start_time), as shown below:

summarized status table

I've been trying to get there by using window functions in Redshift, but I cannot partition based on status as that seems to group the statuses together and I end up with "Active" from 2019-08-15 to 2020-02-15.


Solution

  • This is a so-called gaps-and-islands approach. Written on my phone so untested. But you should be able to search SO for that key-phrase.

    WITH
      sorted AS
    (
      SELECT
        *, 
        ROW_NUMBER()
          OVER (
            PARTITION BY userid
                ORDER BY start
          )
            AS row_userid_start,
        ROW_NUMBER()
          OVER (
            PARTITION BY userid, status
                ORDER BY start
          )
            AS row_userid_status_start
      FROM
        #t
    )
    SELECT
      userid,
      status,
      MIN(start)   AS start,
      MAX(end)     AS end
    FROM
      sorted
    GROUP BY
      userid,
      status,
      row_userid_status_start - row_userid_start