sql sql-server group-by duplicates dense-rank

Group rows together by multiple columns with or condition

I have a table of people that I need to have a way to find duplicate records by multiple possible scenarios. For instance, group together if fname, lname and address are the same OR if fname, lname, dob are the same, OR if fname, lname, and Email are the same. I can't figure out a way to do this in SQL. I am using the above examples solely as examples, because the grouping criteria will eventually be much more strict. I've set up an example with data in SQL Fiddle. My desired result would group records 2-5 together and 1 and 6 would be unique rows.

CREATE TABLE Persons (
    ID int IDENTITY(1,1),
    FirstName varchar(255),  
    LastName varchar(255),    
    Address1 varchar(255),
    City varchar(255),
    State varchar(255),
    BDay Varchar(255),
    Email Varchar(255)
);


INSERT INTO Persons
SELECT 'RICK', 'ALLEN', '44 Street', 'Minneapolis', 'MN', '1/2/1970','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1981','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '42 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'JENNIFER', 'ALLEN', '123 Street', 'Minneapolis', 'MN', '4/8/1980','[email protected]'
UNION ALL
SELECT 'STEVEN', 'ALLEN', '555 Street', 'Minneapolis', 'MN', '2/8/1980','[email protected]'

Solution

You could use a not exists clause:

select  p1.*
from    Persons p1
where   not exists
        (
        select  *
        from    Persons p2
        where   p1.id < p2.id and
                p1.FirstName = p2.FirstName and
                p1.LastName = p2.LastName and
                (
                    p1.Address1 = p2.Address1 or
                    p1.BDay = p2.BDay or
                    p1.Email = p2.Email
                )
        )

Working example on SQL Fiddle.

Replying to your comment, you can mark duplicates in the table with an update query:

with    dupe as
        (
        select  min(p1.ID) as OriginalID
        ,       p2.ID as DupeID
        from    Persons p1
        join    Persons p2
        on      p1.id < p2.id and
                p1.FirstName = p2.FirstName and
                p1.LastName = p2.LastName and
                (
                    p1.Address1 = p2.Address1 or
                    p1.BDay = p2.BDay or
                    p1.Email = p2.Email
                )
        group by
                p2.ID
        )
update  p1
set     DupeOfID = dupe.OriginalID
from    Persons p1
join    dupe
on      dupe.DupeID = p1.ID

Working example on SQL Fiddle.