Search code examples
sqlsql-serverapache-sparkpyspark

Remove redundant/duplicate and keep most complete unique records


I have table:

Row User Given Name Family Name Address ZIP
1 1 John Smith 1 Main Rd 1000
2 1 John 1 Main Rd 1000
3 1 John 1000
4 1 John 0800
5 1 Johnny 1000
6 2 Tom Lock 2 Main Rd 1001
7 2 Tommy 2 Main Rd 1001
8 2 Tom L 2 Main Rd 1001
9 2 1001
10 3 Oak 4 Main Rd 1004
11 3 Tim 4 Main Rd 1004

I want to be able to remove 'redundant' records - which basically just repeats the same information. In the example above row 2 and row 3 are basically just repeating row 1 (but with additional nulls/blanks) and row 9 is repeating row 6.

So the desired output looks like:

Row User Given Name Family Name Address ZIP
1 1 John Smith 1 Main Rd 1000
4 1 John 0800
5 1 Johnny 1000
6 2 Tom Lock 2 Main Rd 1001
7 2 Tommy 2 Main Rd 1001
8 2 Tom L 2 Main Rd 1001
10 3 Oak 4 Main Rd 1004
11 3 Tim 4 Main Rd 1004

Looking to do this in either SQL Server (2014)/Spark/Python.

My initial approach was to:

  • count of null values of every row
  • self join table on User and compare every field (e.g. given name, family name etc)
  • tag records where the comparisons are either all invalid (e.g. nulls) and exact
  • remove those records tagged

It's sort of like a brute force approach so wondering if there are any better / more efficient way of doing this?

With rows 10 and 11 I want to keep both as the given name and family name are unique and not something that's been repeated which I realized my tagging probably wouldn't work.

Here is the code to create the table:

CREATE TABLE user_list (
    rown INT,
    username INT,
    given_name VARCHAR(50),
    family_name VARCHAR(50),
    address VARCHAR(50),
    zip varchar(10)
)

INSERT INTO user_list (rown, username, given_name, family_name, address, zip)
VALUES (1,1,'John', 'Smith', '1 Main Rd', '1000'),
       (2,1,'John', NULL, '1 Main Rd', '1000'),
       (3,1,'John', NULL, NULL, '1000'),
       (4,1,'John', NULL, NULL, '0800'),
       (5,1,'Johnny', NULL, NULL, '1000'),
       (6,2,'Tom', 'Lock', '2 Main Rd', '1001'),
       (7,2,'Tommy', NULL, '2 Main Rd', '1001'),
       (8,2,'Tom', 'L', '2 Main Rd', '1001'),
       (9,2,NULL, NULL, NULL, '1001'),
       (10,3,NULL, 'Oak', '4 Main Rd', '1004'),
       (11,3,'Tim', NULL, '4 Main Rd', '1004');

Solution

  • The following will check for matches and exclude the lesser match. In case of an exact match (same value/null combination), the lower rnum row is retrained.

    WITH CTE_UserList AS (
        -- Calculate number of non-null, non-blank values for each row.
        SELECT
            UL.*,
            CASE WHEN UL.given_name        > '' THEN 1 ELSE 0 END
                + CASE WHEN UL.family_name > '' THEN 1 ELSE 0 END
                + CASE WHEN UL.address     > '' THEN 1 ELSE 0 END
                + CASE WHEN UL.zip         > '' THEN 1 ELSE 0 END
                AS ValueCount
        FROM user_list UL
    )
    SELECT *
    FROM CTE_UserList UL
    WHERE NOT EXISTS (
        -- Find equal or better match
        SELECT *
        FROM CTE_UserList UL2
        WHERE UL2.username = UL.username
        AND ( -- More values is a clear winner. If same values, pick lower rown
            UL2.ValueCount > UL.ValueCount
            OR (UL2.ValueCount = UL.ValueCount AND UL2.rown < UL.rown)
        )
        -- Match on equal or current row value null or blank
        AND (UL2.given_name  = UL.given_name  OR NULLIF(UL.given_name,  '') IS NULL)
        AND (UL2.family_name = UL.family_name OR NULLIF(UL.family_name, '') IS NULL)
        AND (UL2.address     = UL.address     OR NULLIF(UL.address,     '') IS NULL)
        AND (UL2.zip         = UL.zip         OR NULLIF(UL.zip,         '') IS NULL)
    )
    ORDER BY UL.username, UL.rown
    

    Results (with extra data before duplicate elimination)

    rown username given_name family_name address zip ValueCount CoveredByRow
    1 1 John Smith 1 Main Rd 1000 4 null
    2 1 John null 1 Main Rd 1000 3 1
    3 1 John null null 1000 2 1
    4 1 John null null 0800 2 null
    5 1 Johnny null null 1000 2 null
    6 2 Tom Lock 2 Main Rd 1001 4 null
    7 2 Tommy null 2 Main Rd 1001 3 null
    8 2 Tom L 2 Main Rd 1001 4 null
    9 2 null null null 1001 1 6
    10 3 null Oak 4 Main Rd 1004 3 null
    11 3 Tim null 4 Main Rd 1004 3 null
    12 4 null Beam 4 Main Rd 1004 3 14
    13 4 Jim null 4 Main Rd 1004 3 14
    14 4 Jim Beam 4 Main Rd 1004 4 null
    15 5 Exact Match null null 2 null
    16 5 Exact Match null null 2 15

    See this db<>fiddle with some extra data showing a few more partial and one exact match.