Search code examples
sqlamazon-redshiftwindow-functions

SQL (Redshift) get start and end values for consecutive data in a given column


I have a table that has the subscription state of users on any given day. The data looks like this

+------------+------------+--------------+
| account_id |    date    | current_plan |
+------------+------------+--------------+
| 1          | 2019-08-01 | free         |
| 1          | 2019-08-02 | free         |
| 1          | 2019-08-03 | yearly       |
| 1          | 2019-08-04 | yearly       |
| 1          | 2019-08-05 | yearly       |
| ...        |            |              |
| 1          | 2020-08-02 | yearly       |
| 1          | 2020-08-03 | free         |
| 2          | 2019-08-01 | monthly      |
| 2          | 2019-08-02 | monthly      |
| ...        |            |              |
| 2          | 2019-08-31 | monthly      |
| 2          | 2019-09-01 | free         |
| ...        |            |              |
| 2          | 2019-11-26 | free         |
| 2          | 2019-11-27 | monthly      |
| ...        |            |              |
| 2          | 2019-12-27 | monthly      |
| 2          | 2019-12-28 | free         |
+------------+------------+--------------+

I would like to have a table that gives the start and end dats of a subscription. It would look something like this:

+------------+------------+------------+-------------------+
| account_id | start_date |  end_date  | subscription_type |
+------------+------------+------------+-------------------+
|          1 | 2019-08-03 | 2020-08-02 | yearly            |
|          2 | 2019-08-01 | 2019-08-31 | monthly           |
|          2 | 2019-11-27 | 2019-12-27 | monthly           |
+------------+------------+------------+-------------------+

I started by doing a LAG windown function with a bunch of WHERE statements to grab the "state changes", but this makes it difficult to see when customers float in and out of subscriptions and i'm not sure this is the best method.

lag as (
    select *, LAG(tier) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan
            , LAG(date) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan_date
    from data
)
SELECT *
FROM lag
where (current_plan = 'free' and previous_plan in ('monthly', 'yearly'))

Solution

  • This is a gaps-and-islands problem. I think a difference of row numbers works:

    select account_id, current_plan, min(date), max(date)
    from (select d.*,
                 row_number() over (partition by account_id order by date) as seqnum,
                 row_number() over (partition by account_id, current_plan order by date) as seqnum_2
          from data
         ) d
    where current_plan <> free
    group by account_id, current_plan, (seqnum - seqnum_2);