Search code examples
sqlnetezzagaps-and-islands

Creating groupings based on criteria


Hoping to get some assistance with the following gaps and islands problem

I have the following sample dataset:

MEM_ID CLM_ID ADM_DT DCHG_DT PROV
1 111 01-01-2021 01-01-2021 1
1 112 01-01-2021 02-01-2021 1
1 113 01-01-2021 01-01-2021 1
1 114 01-01-2021 01-01-2021 1
1 115 01-01-2021 01-01-2021 1
1 116 02-15-2021 02-15-2021 2
1 117 02-15-2021 02-15-2021 3
1 118 02-16-2021 02-16-2021 3
2 211 01-01-2021 01-01-2021 1
2 212 03-01-2021 03-01-2021 2
3 311 02-01-2021 02-01-2021 1
3 312 02-01-2021 02-01-2021 2

What I am trying to accomplish is the following:

Based on the unique combination of a MEM_ID,PROV,and ADM_DT, I want to group the events into 1. I also need to group any additional events from the same MEM_ID and PROV if the DCHG_DT is the same as the ADM_DT or up to 1 day ahead.

The end result should look like this:

MEM_ID CLM_ID ADM_DT DCHG_DT PROV GROUP
1 111 01-01-2021 01-01-2021 1 1
1 112 01-01-2021 02-01-2021 1 1
1 113 01-01-2021 01-01-2021 1 1
1 114 01-01-2021 01-01-2021 1 1
1 115 01-01-2021 01-01-2021 1 1
1 116 02-15-2021 02-15-2021 2 2
1 117 02-15-2021 02-15-2021 3 3
1 118 02-16-2021 02-16-2021 3 3
2 211 01-01-2021 01-01-2021 1 1
2 212 03-01-2021 03-01-2021 2 2
3 311 02-01-2021 02-01-2021 1 1
3 312 02-01-2021 02-01-2021 2 2

In the expected output, the first 5 records for MEM_ID 1 was grouped into group 1 because it meets the criteria of having the same MEM_ID,PROV, and ADM_DT combination. Records 6 and 7 have the same MEM_ID and ADM_DT combination but different PROV, therefore they are broken into groups 2 and 3. Record 8 has the same combination of MEM_ID and PROV but the ADM_DT is different to that of Record 7. However because the ADM_DT from record 8 is within 1 day of the DCHG_DT from record 7, it is considered the same event and assigned group 3. The rest of the data points are pretty straight forward on why they are grouped the way they are.

I have attempted the following code:


SELECT DISTINCT MEM_ID
    , PROV
    , CLM_ID
    , ADM_DT
    , DCHG_DT
    , sum(ISSTART) OVER (PARTITION BY MEM_ID, ADM_DT ORDER BY ADM_DT, DCHG_DT rows unbounded preceding) AS GROUP

FROM (
    SELECT DISTINCT MEM_ID
        , PROV
        , CLM_ID
        , ADM_DT
        , DCHG_DT
        , CASE 
            WHEN PROV = LAG(PROV) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
                AND (
                        (
                        ADM_DT = LAG(DCHG_DT) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
                        )
                    OR (
                        ADM_DT = LAG(DCHG_DT + 1) OVER (PARTITION BY MEM_ID ORDER BY ADM_DT, DCHG_DT)
                        )
                    )
                THEN 0
            ELSE 1
            END AS ISSTART
    
    FROM c1
    ) t

However this is giving me errors in grouping.

Using the above code i get something similar to this:

MEM_ID CLM_ID ADM_DT DCHG_DT PROV GROUP
1 111 01-01-2021 01-01-2021 1 0
1 112 01-01-2021 02-01-2021 1 1
1 113 01-01-2021 01-01-2021 1 1
1 114 01-01-2021 01-01-2021 1 1
1 115 01-01-2021 01-01-2021 1 1
1 116 02-15-2021 02-15-2021 2 0
1 117 02-15-2021 02-15-2021 3 0
1 118 02-16-2021 02-16-2021 3 1

Any help is appreciated!

Thank you!


Solution

  • select *,
        sum(gap) over (partition by MEM_ID order by ADM_DT, CLM_ID) as grp
    from (
        select *,
            case when
                    ADM_DT =
                    lag(ADM_DT)  over (partition by MEM_ID, PROV order by ADM_DT, CLM_ID)
                    or           
                days_between(
                    ADM_DT,
                    lag(DCHG_DT) over (partition by MEM_ID, PROV order by ADM_DT, CLM_ID)
                ) in (0, 1) -- not sure if zero is valid
                then 0 else 1 end as gap
        from c1
    ) t
    order by MEM_ID, PROV, GRP, ADM_DT
    

    If you don't have days_between() available then apparently you would just use when lag(DCHG_DT) over (partition by MEM_ID, PROV order by ADM_DT, CLM_ID) - ADM_DT

    Note that Postgres handles the date math differently but you can try it out https://dbfiddle.uk/?rdbms=postgres_13&fiddle=95f28f3d57c520d1e45422b588dcdd85