Search code examples
sqlsql-servergaps-and-islandsgaps-in-data

Gaps and island fails with 3 columns using SQL Server


I have come across a strange behavior with the gaps and island solution. With 3 columns (3rd column being non integer), the result is random really. Let's suppose we the following query:

Declare @Table1 TABLE
(
    ID varchar(50), 
    yr float, 
    CO1 varchar(50)
);

INSERT INTO @Table1 (ID, yr, CO1)
VALUES ('I2','2011','ABE'), ('I2','2012','ABE'), ('I2','2013','ABE'),
       ('I2','2014','ABE'), ('I2','2014','ABE'), ('I2','2005','ABD'),
       ('I2','2006','ABD'), ('I2','2007','ABD'), ('I2','2008','ABD'),
       ('I2','2007','ABA CD'), ('I2','2011','ABA CD'), ('I2','2013','ABA CD');

SELECT 
    ID, CO1, StartSeqNo = MIN(yr), EndSeqNo = MAX(yr)
FROM 
    (SELECT 
         ID, yr, CO1,
         rn = yr - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY yr)
     FROM 
         @Table1) a
GROUP BY 
    ID, CO1, rn ;

The result I am aiming for is :

ID  CO1    StartSeqNo   EndSeqNo
----------------------------
I2  ABA CD    2007       2007
I2  ABA CD    2011       2011
I2  ABA CD    2013       2013
I2  ABD       2005       2008
I2  ABE       2011       2014

I have looked through stackoverflow and elsewhere to determine if I was missing something. I already tried with distinct and dense_rank, neither gives the proper result

Here are the distinct and dense_rank queries I've already tried:

--- distinct 

SELECT distinct ID,CO1, StartSeqNo=MIN(yr), EndSeqNo=MAX(yr)
FROM (
    SELECT distinct ID, yr, CO1
        ,rn=yr-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY yr)
    FROM @Table1) a
GROUP BY ID, CO1, rn ;

--- with dense_rank
SELECT ID,CO1, StartSeqNo=MIN(yr), EndSeqNo=MAX(yr)
FROM (
    SELECT ID, yr, CO1
        ,rn=yr-dense_rank() OVER (PARTITION BY ID ORDER BY yr)
    FROM @Table1) a
GROUP BY ID, CO1, rn ;

I dont see why the gaps and island query would not work with having a non-integer column. I reckon there is an issue with grouping somewhere. Please help me with this.

Sim


Solution

  • With no gaps, the years would be a sequential numbering in each ID/CO1 group that you can compare to a no-gap numbering which of course also must be sequential for each ID/CO1 ordered by year. So, if you don't ORDER BY CO1 (before year), you must also use CO1 to PARTITION BY in the row numbering function. Also, your data contains duplicate rows, so to give equal years in an ID/CO1-group the same number, use the RANK function instead of ROW_NUMBER:

    WITH a (ID, CO1, yr, nmbr) AS (
      SELECT ID, CO1, yr
        , yr - RANK() OVER (PARTITION BY ID, CO1 ORDER BY yr)
      FROM @Table1
    )
    SELECT ID, CO1, StartSeqNo = MIN(yr), EndSeqNo = MAX(yr)
    FROM a
    GROUP BY ID, CO1, nmbr;
    

    At last let me suggest to use int instead of float for year numbers.