Search code examples
sqlsql-servercontiguous

Removing All but the first and last values by group when the group is repeated in MS SQL Server (contiguous)


We have a chat system that generates multiple event logs per second sometimes for every event during a chat. The issue is that these consume a massive amount of data storage (which is very expensive on that platform) and we'd like to streamline what we actually store and delete things that really aren't necessary.

To that end, there's an event type for what position in the queue the chat is. We don't care about each position as long as they are not intervening events for that chat. So we want to keep only the first and last in each distinct group where there were no other event types to just get "total time in queue" for that period.

To complicate this, a customer can go in and out of queue as they get transferred by department, so the SAME CHAT can have multiple blocks of these queue position records. I've tried using FIRST_VALUE and LAST_VALUE and it gets me most of the way there, but fails when we have the case of two distinct blocks of these events.

Here's the script to generate the test data:

<!-- language: lang-sql -->
    CREATE TABLE #testdata (
    id varchar(18),
    name varchar(8),
    [type] varchar(20),
    livechattranscriptid varchar(18),
    groupid varchar(40)) 

INSERT INTO #testdata (id,name,[type],livechattranscriptid,groupid) VALUES 
('0DZ14000003I2pOGAS','34128314','ChatRequest','57014000000ltfIAAQ','57014000000ltfIAAQChatRequest'),
('0DZ14000003IGmQGAW','34181980','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IHbqGAG','34185171','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003ILuHGAW','34201743','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IQ6cGAG','34217778','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IR7JGAW','34221794','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003IiDnGAK','34287448','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003IiDoGAK','34287545','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
('0DZ14000003Iut5GAC','34336044','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
('0DZ14000003Iv7HGAS','34336906','Accept','57014000000ltfIAAQ','57014000000ltfIAAQAccept')

And here is the attempt to identify anything that was the first and last id for it's group ordered by the name field and grouped by the transcriptid:

select *,FIRST_VALUE(id) OVER(Partition BY groupid order by livechattranscriptid,name asc) as firstinstancegroup,
    LAST_VALUE(id) OVER(Partition BY groupid order by livechattranscriptid,name asc RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as lastinstancegroup from #testdata order by livechattranscriptid,name

The issue is, it gives me the same first and last id for ALL of them by that entire group rather than treating each group of Enqueue records as a distinct group. How would I treat each distinct grouping instance of Enqueue as a unique group?


Solution

  • Here's a similar solution Grouping contiguous table data

    not pretty but you will find the logic based from the OP. contiguous data over the same column

    declare @mytable table (
        id varchar(18),
        name varchar(8),
        [type] varchar(20),
        livechattranscriptid varchar(18),
        groupid varchar(100)) 
    
    INSERT INTO @mytable (id,name,[type],livechattranscriptid,groupid) VALUES 
    ('0DZ14000003I2pOGAS','34128314','ChatRequest','57014000000ltfIAAQ','57014000000ltfIAAQChatRequest'),
    ('0DZ14000003IGmQGAW','34181980','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003IHbqGAG','34185171','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003ILuHGAW','34201743','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003IQ6cGAG','34217778','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003IR7JGAW','34221794','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
    ('0DZ14000003IiDnGAK','34287448','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003IiDoGAK','34287545','PushAssignment','57014000000ltfIAAQ','57014000000ltfIAAQPushAssignment'),
    ('0DZ14000003Iut5GAC','34336044','Enqueue','57014000000ltfIAAQ','57014000000ltfIAAQEnqueue'),
    ('0DZ14000003Iv7HGAS','34336906','Accept','57014000000ltfIAAQ','57014000000ltfIAAQAccept')
    
    
    ;with myend as (   --- get all ends
     select 
     *
      from 
     (select 
       iif(groupid <> lead(groupid,1,groupid) over (order by name),
         id,
         'x') [newid],name
     from @mytable
     )x 
     where newid <> 'x'
     )
     , mystart as   -- get all starts
     (
     select 
      *
        from 
     (select 
       iif(groupid <> lag(groupid,1,groupid) over (order by name),
         id,
         'x') [newid], name,type,livechattranscriptid
     from @mytable
     )x 
     where newid <> 'x'
     )  ,
     finalstart as (   --- get all starts including the first row
    
      select id, 
        name,type,livechattranscriptid,
        row_number() over (order by name) rn
        from (
        select id,name,type,livechattranscriptid 
        from (
        select top 1 id, name,type,livechattranscriptid
        from @mytable
        order by name) x
        union all
        select newid,name,type,livechattranscriptid from mystart
        ) y
    
     ),
     finalend as   -- get all ends and add the last row
       (
    
      select id, 
        row_number() over (order by name) rn
        from (
        select id,name from (
        select top 1 id,name
        from @mytable
        order by name desc) x
        union all
        select newid,name from myend
        ) y
      )
    select 
      s.id [startid]
      ,s.name
      ,s.type
      ,s.livechattranscriptid
      ,e.id [lastid]
       from    
      finalend e
      inner join finalstart s 
         on   e.rn = s.rn    --- bind the two results over the positions or row number