I have table:
Row | User | Given Name | Family Name | Address | ZIP |
---|---|---|---|---|---|
1 | 1 | John | Smith | 1 Main Rd | 1000 |
2 | 1 | John | 1 Main Rd | 1000 | |
3 | 1 | John | 1000 | ||
4 | 1 | John | 0800 | ||
5 | 1 | Johnny | 1000 | ||
6 | 2 | Tom | Lock | 2 Main Rd | 1001 |
7 | 2 | Tommy | 2 Main Rd | 1001 | |
8 | 2 | Tom | L | 2 Main Rd | 1001 |
9 | 2 | 1001 | |||
10 | 3 | Oak | 4 Main Rd | 1004 | |
11 | 3 | Tim | 4 Main Rd | 1004 |
I want to be able to remove 'redundant' records - which basically just repeats the same information. In the example above row 2 and row 3 are basically just repeating row 1 (but with additional nulls/blanks) and row 9 is repeating row 6.
So the desired output looks like:
Row | User | Given Name | Family Name | Address | ZIP |
---|---|---|---|---|---|
1 | 1 | John | Smith | 1 Main Rd | 1000 |
4 | 1 | John | 0800 | ||
5 | 1 | Johnny | 1000 | ||
6 | 2 | Tom | Lock | 2 Main Rd | 1001 |
7 | 2 | Tommy | 2 Main Rd | 1001 | |
8 | 2 | Tom | L | 2 Main Rd | 1001 |
10 | 3 | Oak | 4 Main Rd | 1004 | |
11 | 3 | Tim | 4 Main Rd | 1004 |
Looking to do this in either SQL Server (2014)/Spark/Python.
My initial approach was to:
It's sort of like a brute force approach so wondering if there are any better / more efficient way of doing this?
With rows 10 and 11 I want to keep both as the given name and family name are unique and not something that's been repeated which I realized my tagging probably wouldn't work.
Here is the code to create the table:
CREATE TABLE user_list (
rown INT,
username INT,
given_name VARCHAR(50),
family_name VARCHAR(50),
address VARCHAR(50),
zip varchar(10)
)
INSERT INTO user_list (rown, username, given_name, family_name, address, zip)
VALUES (1,1,'John', 'Smith', '1 Main Rd', '1000'),
(2,1,'John', NULL, '1 Main Rd', '1000'),
(3,1,'John', NULL, NULL, '1000'),
(4,1,'John', NULL, NULL, '0800'),
(5,1,'Johnny', NULL, NULL, '1000'),
(6,2,'Tom', 'Lock', '2 Main Rd', '1001'),
(7,2,'Tommy', NULL, '2 Main Rd', '1001'),
(8,2,'Tom', 'L', '2 Main Rd', '1001'),
(9,2,NULL, NULL, NULL, '1001'),
(10,3,NULL, 'Oak', '4 Main Rd', '1004'),
(11,3,'Tim', NULL, '4 Main Rd', '1004');
The following will check for matches and exclude the lesser match. In case of an exact match (same value/null combination), the lower rnum row is retrained.
WITH CTE_UserList AS (
-- Calculate number of non-null, non-blank values for each row.
SELECT
UL.*,
CASE WHEN UL.given_name > '' THEN 1 ELSE 0 END
+ CASE WHEN UL.family_name > '' THEN 1 ELSE 0 END
+ CASE WHEN UL.address > '' THEN 1 ELSE 0 END
+ CASE WHEN UL.zip > '' THEN 1 ELSE 0 END
AS ValueCount
FROM user_list UL
)
SELECT *
FROM CTE_UserList UL
WHERE NOT EXISTS (
-- Find equal or better match
SELECT *
FROM CTE_UserList UL2
WHERE UL2.username = UL.username
AND ( -- More values is a clear winner. If same values, pick lower rown
UL2.ValueCount > UL.ValueCount
OR (UL2.ValueCount = UL.ValueCount AND UL2.rown < UL.rown)
)
-- Match on equal or current row value null or blank
AND (UL2.given_name = UL.given_name OR NULLIF(UL.given_name, '') IS NULL)
AND (UL2.family_name = UL.family_name OR NULLIF(UL.family_name, '') IS NULL)
AND (UL2.address = UL.address OR NULLIF(UL.address, '') IS NULL)
AND (UL2.zip = UL.zip OR NULLIF(UL.zip, '') IS NULL)
)
ORDER BY UL.username, UL.rown
Results (with extra data before duplicate elimination)
rown | username | given_name | family_name | address | zip | ValueCount | CoveredByRow |
---|---|---|---|---|---|---|---|
1 | 1 | John | Smith | 1 Main Rd | 1000 | 4 | null |
2 | 1 | John | null | 1 Main Rd | 1000 | 3 | 1 |
3 | 1 | John | null | null | 1000 | 2 | 1 |
4 | 1 | John | null | null | 0800 | 2 | null |
5 | 1 | Johnny | null | null | 1000 | 2 | null |
6 | 2 | Tom | Lock | 2 Main Rd | 1001 | 4 | null |
7 | 2 | Tommy | null | 2 Main Rd | 1001 | 3 | null |
8 | 2 | Tom | L | 2 Main Rd | 1001 | 4 | null |
9 | 2 | null | null | null | 1001 | 1 | 6 |
10 | 3 | null | Oak | 4 Main Rd | 1004 | 3 | null |
11 | 3 | Tim | null | 4 Main Rd | 1004 | 3 | null |
12 | 4 | null | Beam | 4 Main Rd | 1004 | 3 | 14 |
13 | 4 | Jim | null | 4 Main Rd | 1004 | 3 | 14 |
14 | 4 | Jim | Beam | 4 Main Rd | 1004 | 4 | null |
15 | 5 | Exact | Match | null | null | 2 | null |
16 | 5 | Exact | Match | null | null | 2 | 15 |
See this db<>fiddle with some extra data showing a few more partial and one exact match.