Search code examples
mysqlsqlignore-duplicates

SQL: Detect duplicate customers


im trying to create a sql query, that will detect (possible) duplicate customers in my database:

I have two tables:

  1. Customer with the columns: cid, firstname, lastname, zip. Note that cid is the unique customer id and primary key for this table.
  2. IgnoreForDuplicateCustomer with the columns: cid1, cid2. Both columns are foreign keys, which references to Customer(cid). This table is used to say, that the customer with cid1 is not the same as the customer with the cid2.

So for example, if i have

  • a Customer entry with cid = 1, firstname="foo", lastname="anonymous" and zip="11231"
  • and another Customer entry with cid=2, firstname="foo", lastname="anonymous" and zip="11231".

So my sql query should search for customers, that have the same firstname, lastname and zip and the detect that customer with cid = 1 is the same as customer with cid = 2.

However, it should be possible to say, that customer cid = 1 and cid=2 are not the same, by storing a new entry in the IgnoreForDuplicateCustomer table by setting cid1 = 1 and cid2 = 2.

So detecting the duplicate customers work well with this sql query script:

SELECT cid, firstname, lastname, zip, COUNT(*) AS NumOccurrences
       FROM Customer
 GROUP BY fistname, lastname,zip
       HAVING ( COUNT(*) > 1 )

My problem is, that i am not able, to integrate the IgnoreForDuplicateCustomer table, to that like in my previous example the customer with cid = 1 and cid=2 will not be marked / queried as the same, since there is an entry/rule in the IgnoreForDuplicateCustomer table.

So i tried to extend my previous query by adding a where clause:

    SELECT cid, firstname, lastname, COUNT(*) AS NumOccurrences
               FROM Customer    
    WHERE cid NOT IN (
                     SELECT cid1 FROM IgnoreForDuplicateCustomer WHERE cid2=cid 
                     UNION 
                     SELECT cid2 FROM IgnoreForDuplicateCustomer WHERE cid1=cid
                     )  
     GROUP BY firstname, lastname, zip
     HAVING ( COUNT(*) > 1 )

Unfortunately this additional WHERE clause has absolutely no impact on my result. Any suggestions?


Solution

  • Here you are:

    Select a.*
    From (
      select c1.cid 'CID1', c2.cid 'CID2'
      from Customer c1 
      join Customer c2 on c1.firstname=c2.firstname 
        and c1.lastname=c2.lastname and c1.zip=c2.zip
        and c1.cid < c2.cid) a
    Left Join (
      Select cid1 'CID1', cid2 'CID2'
      From ignoreforduplicatecustomer one
     Union
      Select cid2 'CID1', cid1 'CID2'
      From ignoreforduplicatecustomer two) b on a.cid1 = b.cid1 and a.cid2 = b.cid2
    where b.cid1 is null
    

    This will get you the IDs of duplicate records from customer table, which are not in table ignoreforduplicatecustomer.

    Tested with:

    CREATE TABLE IF NOT EXISTS `customer` (
     `CID` int(11) NOT NULL AUTO_INCREMENT,
     `Firstname` varchar(50) NOT NULL,
     `Lastname` varchar(50) NOT NULL,
     `ZIP` varchar(10) NOT NULL,
     PRIMARY KEY (`CID`)) 
    ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=100 ;
    
    INSERT INTO `customer` (`CID`, `Firstname`, `Lastname`, `ZIP`) VALUES
    (1, 'John', 'Smith', '1234'),
    (2, 'John', 'Smith', '1234'),
    (3, 'John', 'Smith', '1234'),
    (4, 'Jane', 'Doe', '1234');
    

    And:

    CREATE TABLE IF NOT EXISTS `ignoreforduplicatecustomer` (
     `CID1` int(11) NOT NULL,
     `CID2` int(11) NOT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    
    
    INSERT INTO `ignoreforduplicatecustomer` (`CID1`, `CID2`) VALUES
    (1, 2);
    

    Results for my test setup are:

    CID1  CID2
     1     3
     2     3