Search code examples
sqlsql-servert-sql

Deduplicating IDs accros two columns


the goal is to depulicate the IDs in P1ID and P2ID.

I'm working on SQL Server 2019.

  • IDs in ID_GROUP are not shared accross groups, but IDs occur both in P1ID and P2ID within a group
  • ID is ROW_NUMBER() OVER(ORDER BY ID_GROUP, P1_RANK, P2_RANK
  • the first occurence of the ID should be used, the rest discarded, first according to column
  • the table DesiredOutcome displays the desired results

CREATE TABLE InitialResultSet (
    ID INT,
    ID_GROUP  VARCHAR(10),
    P1ID BIGINT,
    P2ID BIGINT,
    P1_RANK VARCHAR(10),
    P2_RANK VARCHAR(10)
);


INSERT INTO InitialResultSet (ID, ID_GROUP , P1ID, P2ID, P1_RANK , P2_RANK ) VALUES
(1, 'A', 1, 2, '0001', '0002'), 
(2, 'A', 2, 3, '0002', '0005'), 
(3, 'B', 4, 5, '0002', '0026'), 
(4, 'B', 6, 8, '0017', '0024'), 
(5, 'B', 6, 9, '0017', '0043'), 
(6, 'B', 6, 10, '0017', '0044'), 
(7, 'B', 8, 7, '0024', '0098'), 
(8, 'B', 9, 7, '0043', '0098'), 
(9, 'B', 10, 7, '0044', '0098');


CREATE TABLE DesiredOutcome (
    ID INT,
    ID_GROUP VARCHAR(10),
    P1ID BIGINT,
    P2ID BIGINT,
    P1_RANK VARCHAR(10),
    P2_RANK VARCHAR(10)
);

INSERT INTO DesiredOutcome (ID, ID_GROUP , P1ID, P2ID, P1_RANK, P2_RANK) VALUES
(1, 'A', 1, 2, '0001', '0002'),
(3, 'B', 4, 5, '0002', '0026'),
(4, 'B', 6, 8, '0017', '0024'),
(8, 'B', 9, 7, '0043', '0098');

InitialResultSet

ID ID_GROUP P1ID P2ID P1_RANK P2_RANK
1 A 1 2 0001 0002
2 A 2 3 0002 0005
3 B 4 5 0002 0026
4 B 6 8 0017 0024
5 B 6 9 0017 0043
6 B 6 10 0017 0044
7 B 8 7 0024 0098
8 B 9 7 0043 0098
9 B 10 7 0044 0098

DesiredOutcome

ID ID_GROUP P1ID P2ID P1_RANK P2_RANK
1 A 1 2 0001 0002
3 B 4 5 0002 0026
4 B 6 8 0017 0024
8 B 9 7 0043 0098

fiddle


Solution

  • Here's the way to do it recursively. This follows each ID_GROUP in order and determines which nodes are kept. It generates and uses a text string of ids to match for the ones that have already been retained:

    with dataX as (
        select *,
            row_number() over (partition by ID_GROUP order by ID) as gn,
            cast(concat('|', P1ID, '|', P2ID, '|') as varchar(max)) as payload
        from InitialResultSet
    ), chain as (
        select ID, ID_GROUP, gn, payload as allocated, 'Y' as keep
        from dataX where gn = 1
        union all
        select d.ID, d.ID_GROUP, d.gn,
            concat(c.allocated,
                case when allocated not like concat('%|', P1ID, '|%')
                      and allocated not like concat('%|', P2ID, '|%') then payload end),
            case when allocated not like concat('%|', P1ID, '%')
                  and allocated not like concat('%|', P2ID, '|%') then 'Y' else 'N' end
        from chain c inner join dataX d
            on d.ID_GROUP = c.ID_GROUP and d.gn = c.gn + 1
    )
    select * from InitialResultSet
    where ID in (select ID from chain where keep = 'Y');
    

    Based on OP's description of ID it would be possible to use that value instead of gn in order to follow the nodes in sequence. (You might possibly get a better plan by explicitly sorting on the same columns too.)

    Using gn does have the advantage in terms of recursion depth though. This query will never need more iterations than the size of the largest ID_GROUP whereas using ID will only add a single row on each pass and will involve significantly longer strings and corresponding scans. As a side benefit, this would also work under a different scenario where id values can be repeated across different ID_GROUPS.