Search code examples
sqlgoogle-bigqueryduplicatesaggregate-functions

How to find duplicate records by full name and similar DOB?


I'm looking for ideas on how best to find duplicate entries in the database for records that may differ only in Date of Birth (DOB). It would be great to find all duplicate entries that share the same Full Name and whose DOB is within a year of one another?

Example:

Tom Johnson | 1990-12-01
Tom Johnson | 1991-12-01
Ted Janson  | 1992-01-01
Tom Johnson | 2000-02-02
Bob Burke   | 2002-06-12

I would like to identify the first two rows as potential duplicates so we can further investigate, given their DOB is so similar and their full name matches. We are assuming that some subset of entries in our data could be duplicates based on common typos and mistakes end users make.

What's the best way to identify and group these records?

Edit: I am not looking for other ways to find duplicates, I am inquiring as to whether or not there's a straightforward way to accomplish identifying records with similar birthdates as mentioned above.

It's very common to use something like this when looking for identical matches:

SELECT Name, DOB
FROM table
GROUP BY Name, DOB
HAVING COUNT(*) > 1

But this solution is not what I am looking for here. We already employ this technique.

Thanks!


Solution

  • find duplicate entries in the database for records that may differ only in Date of Birth (DOB)

    Here is one way to do it with a self-join:

    select t1.name name1, t1.dob dob1, t2.name name2, t2.dob dbo2
    from mytable t1
    inner join mytable t2 
        on  t2.name = t1.name 
        and t2.dob  > t1.dob
        and t2.dob  <= date_add(t1.dob, interval 1 year)     
        
    

    This puts on the same row tuples of users that have the same name and a DOB within one year of each other.

    You could also use window functions with a range frame, which might be more efficient. This brings all rows for which another row exist with the same name within the last or next year (actually, 365 days):

    select *
    from (
        select t.*, 
            count(*) over(
                partition by name
                order by unix_date(dob)
                range between 365 preceding and 365 following
            ) cnt
        from mytable t
    ) t
    where cnt > 1