Search code examples
sqlsql-servert-sql

Removing duplicate number sets from a table


I am trying to find out how I can see in my table where the same set of numbers appears.

Eg. I have a table with 10 columns numbered 1 -10. each column has a different number from 1 - 10.

I want to see if there is the same number set in any rows.
Eg.
Row 1 : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Row 2 : 2, 3, 4, 5, 6, 7, 8, 9, 10, 1

These are different but they have the same numbers in them so I want to remove any number sets that are similar in this regard.


Solution

  • You can you can use a combination of CROSS APPLY, VALUES, and STRING_AGG to unpivot and aggregate the values into a single ordered comma-separated list that uniquely defines the set.

    From there, you can use a COUNT(*) window function to count the duplicates, ROW_NUMBER() to number the duplicates, and even DENSE_RANK() to number the distinct sets. One or more of these values can be used to filter the results as desired.

    WITH CombinedValues AS (
        SELECT
            D.*,
            VL.ValueList,
            DENSE_RANK() OVER(ORDER BY VL.ValueList) AS SetNumber,
            COUNT(*) OVER(PARTITION BY VL.ValueList) AS DupCount,
            ROW_NUMBER() OVER(PARTITION BY VL.ValueList ORDER BY D.Id) AS RowNum
        FROM Data D
        CROSS APPLY (
            SELECT STRING_AGG(V.Value, ',') WITHIN GROUP(ORDER BY V.Value) AS ValueList
            FROM (
                VALUES
                    (Col1), (Col2), (Col3), (Col4), (Col5),
                    (Col6), (Col7), (Col8), (Col9), (Col10)
            ) V(Value)
        ) VL
    )
    SELECT *
    FROM CombinedValues CV
    --WHERE CV.DupCount >= 2 -- Uncomment to only show the duplicates
    --WHERE CV.RowNum = 1    -- Uncomment to only show The first of each distinct set
    ORDER BY ValueList
    

    Sample results (with some extra test data):

    Id Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 ValueList Set
    Number
    Dup
    Count
    Row
    Num
    1 1 2 3 4 5 6 7 8 9 10 1,2,3,4,5,6,7,8,9,10 1 2 1
    2 2 3 4 5 6 7 8 9 10 1 1,2,3,4,5,6,7,8,9,10 1 2 2
    5 5 5 5 5 5 5 5 5 5 6 5,5,5,5,5,5,5,5,5,6 2 1 1
    3 5 6 5 6 5 6 5 6 5 6 5,5,5,5,5,6,6,6,6,6 3 2 1
    4 5 5 5 5 5 6 6 6 6 6 5,5,5,5,5,6,6,6,6,6 3 2 2

    If duplicate values are present within a set, two sets must have the same number of each value to be considered a match. If you only care about distinct values, you can modify the above query with:

    ...
        CROSS APPLY (
            -- Distinct values
            SELECT STRING_AGG(DV.Value, ',') WITHIN GROUP(ORDER BY DV.Value) AS ValueList
            FROM (
                SELECT DISTINCT Value
                FROM (
                    VALUES
                        (Col1), (Col2), (Col3), (Col4), (Col5),
                        (Col6), (Col7), (Col8), (Col9), (Col10)
                ) V(Value)
            ) DV
        ) VL
    ...
    

    Results:

    Id Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 ValueList Set
    Number
    Dup
    Count
    Row
    Num
    1 1 2 3 4 5 6 7 8 9 10 1,2,3,4,5,6,7,8,9,10 1 2 1
    2 2 3 4 5 6 7 8 9 10 1 1,2,3,4,5,6,7,8,9,10 1 2 2
    3 5 6 5 6 5 6 5 6 5 6 5,6 2 3 1
    4 5 5 5 5 5 6 6 6 6 6 5,6 2 3 2
    5 5 5 5 5 5 5 5 5 5 6 5,6 2 3 3

    See this db<>fiddle for a demo.