Search code examples
sqlsql-servergroup-byduplicatesdense-rank

Group rows together by multiple columns with or condition


I have a table of people that I need to have a way to find duplicate records by multiple possible scenarios. For instance, group together if fname, lname and address are the same OR if fname, lname, dob are the same, OR if fname, lname, and Email are the same. I can't figure out a way to do this in SQL. I am using the above examples solely as examples, because the grouping criteria will eventually be much more strict. I've set up an example with data in SQL Fiddle. My desired result would group records 2-5 together and 1 and 6 would be unique rows.

CREATE TABLE Persons (
    ID int IDENTITY(1,1),
    FirstName varchar(255),  
    LastName varchar(255),    
    Address1 varchar(255),
    City varchar(255),
    State varchar(255),
    BDay Varchar(255),
    Email Varchar(255)
);


INSERT INTO Persons
SELECT 'RICK', 'ALLEN', '44 Street', 'Minneapolis', 'MN', '1/2/1970','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1981','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '42 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'STEVEN', 'ALLEN', '555 Street', 'Minneapolis', 'MN', '2/8/1980','[email protected]'

Solution

  • You could use a not exists clause:

    select  p1.*
    from    Persons p1
    where   not exists
            (
            select  *
            from    Persons p2
            where   p1.id < p2.id and
                    p1.FirstName = p2.FirstName and
                    p1.LastName = p2.LastName and
                    (
                        p1.Address1 = p2.Address1 or
                        p1.BDay = p2.BDay or
                        p1.Email = p2.Email
                    )
            )
    

    Working example on SQL Fiddle.

    Replying to your comment, you can mark duplicates in the table with an update query:

    with    dupe as
            (
            select  min(p1.ID) as OriginalID
            ,       p2.ID as DupeID
            from    Persons p1
            join    Persons p2
            on      p1.id < p2.id and
                    p1.FirstName = p2.FirstName and
                    p1.LastName = p2.LastName and
                    (
                        p1.Address1 = p2.Address1 or
                        p1.BDay = p2.BDay or
                        p1.Email = p2.Email
                    )
            group by
                    p2.ID
            )
    update  p1
    set     DupeOfID = dupe.OriginalID
    from    Persons p1
    join    dupe
    on      dupe.DupeID = p1.ID
    

    Working example on SQL Fiddle.