Search code examples
sqlsql-servergroup-byrow-number

SQL Server - group by with row number - Gaps and Islands


I have a table with data like this

+-------------+--------------+------------+----------------+
| CustomerSID | StartDateSID | EndDateSID | MarketingOptIn |
+-------------+--------------+------------+----------------+
|       12345 |     20101019 |   20131016 | Y              |
|       12345 |     20131017 |   20140413 | Y              |
|       12345 |     20140414 |   20140817 | N              |
|       12345 |     20140818 |   20141228 | N              |
|       12345 |     20141229 |   20150125 | Y              |
|       12345 |     20150126 |          0 | Y              |
+-------------+--------------+------------+----------------+

I need to create a view on top of this table to have data formatted in the below format for the Flag, basically the duration for which the Flag was Y or N. (EndDateSID - 0 is currently active, so today's date)

+-------------+--------------+------------+----------------+
| CustomerSID | StartDateSID | EndDateSID | MarketingOptIn |
+-------------+--------------+------------+----------------+
|       12345 |     20101019 |   20140413 | Y              |
|       12345 |     20140414 |   20141228 | N              |
|       12345 |     20141229 |   20150825 | Y              |
+-------------+--------------+------------+----------------+

Most customers only have a change in their Flag once, hence below query works:

SELECT 
CH1.CustomerSID
,MIN(CH1.StartDateSID) StartDate
,MAX(ISNULL(NULLIF(CH1.EndDateSID,0),CONVERT(INT, CONVERT(VARCHAR, GETDATE(), 112)))) EndDate
,CH1.MarketingOptIn 
FROM DWH.DimCustomerHistory CH1
GROUP BY CH1.CustomerSID, CH1.MarketingOptIn
ORDER BY CH1.CustomerSID, CH1.MarketingOptIn

How can I achieve the intended output for customers like the one above, having changes in the flag more than once?


Solution

  • This is a gaps and islands problem. You need to use ROW_NUMBER() to identify your gaps, so the start stage would be:

    SELECT  CustomerSID,
            StartDateSID,
            EndDateSID,
            MarketingOptIn,
            TotalRowNum = ROW_NUMBER() OVER(PARTITION BY CustomerSID ORDER BY StartDateSID),
            RowNumInGroup = ROW_NUMBER() OVER(PARTITION BY CustomerSID, MarketingOptIn ORDER BY StartDateSID),
            GroupID = ROW_NUMBER() OVER(PARTITION BY CustomerSID ORDER BY StartDateSID) -   
                    ROW_NUMBER() OVER(PARTITION BY CustomerSID, MarketingOptIn ORDER BY StartDateSID)
    FROM    dbo.YourTable;
    

    Output:

    CustomerSID StartDateSID    EndDateSID  MarketingOptIn  TotalRowNum RowNumInGroup   GroupID
    ---------------------------------------------------------------------------------------------------
    12345       20101019        20131016    Y               1           1               0
    12345       20131017        20140413    Y               2           2               0
    12345       20140414        20140817    N               3           1               2
    12345       20140818        20141228    N               4           2               2
    12345       20141229        20150125    Y               5           3               2
    12345       20150126        0           Y               6           4               2
    

    The key here is that by taking the row number of each row, and also the row number of each row withing the group, you can get a unique identifier (GroupID + MarketingOptIn) which identifies each of your islands. Then it is just a case of grouping by this identifier when doing your aggregates:

    FULL WORKING EXAMPLE

    DECLARE @T TABLE 
    (   
        CustomerSID INT,
        StartDateSID INT,
        EndDateSID INT,
        MarketingOptIn CHAR(1)
    )
    INSERT @T 
    VALUES
        (12345, 20101019, 20131016, 'Y'),
        (12345, 20131017, 20140413, 'Y'),
        (12345, 20140414, 20140817, 'N'),
        (12345, 20140818, 20141228, 'N'),
        (12345, 20141229, 20150125, 'Y'),
        (12345, 20150126, 0, 'Y');
    
    
    WITH CTE AS
    (
        SELECT  CustomerSID,
                StartDateSID,
                EndDateSID,
                MarketingOptIn,
                GroupID = ROW_NUMBER() OVER(PARTITION BY CustomerSID ORDER BY StartDateSID) -   
                        ROW_NUMBER() OVER(PARTITION BY CustomerSID, MarketingOptIn ORDER BY StartDateSID)
        FROM    @T
    )
    SELECT  CustomerSID, 
            StartDateSID = MIN(StartDateSID),
            EndDateSID = CASE WHEN MIN(EndDateSID) = 0 THEN CONVERT(INT, CONVERT(VARCHAR(8), GETDATE(), 112)) ELSE MAX(EndDateSID) END,
            MarketingOptIn
    FROM    CTE
    GROUP BY CustomerSID, MarketingOptIn, GroupID
    ORDER BY CustomerSID, StartDateSID;