im trying to create a sql query, that will detect (possible) duplicate customers in my database:
I have two tables:
So for example, if i have
So my sql query should search for customers, that have the same firstname, lastname and zip and the detect that customer with cid = 1 is the same as customer with cid = 2.
However, it should be possible to say, that customer cid = 1 and cid=2 are not the same, by storing a new entry in the IgnoreForDuplicateCustomer table by setting cid1 = 1 and cid2 = 2.
So detecting the duplicate customers work well with this sql query script:
SELECT cid, firstname, lastname, zip, COUNT(*) AS NumOccurrences
FROM Customer
GROUP BY fistname, lastname,zip
HAVING ( COUNT(*) > 1 )
My problem is, that i am not able, to integrate the IgnoreForDuplicateCustomer table, to that like in my previous example the customer with cid = 1 and cid=2 will not be marked / queried as the same, since there is an entry/rule in the IgnoreForDuplicateCustomer table.
So i tried to extend my previous query by adding a where clause:
SELECT cid, firstname, lastname, COUNT(*) AS NumOccurrences
FROM Customer
WHERE cid NOT IN (
SELECT cid1 FROM IgnoreForDuplicateCustomer WHERE cid2=cid
UNION
SELECT cid2 FROM IgnoreForDuplicateCustomer WHERE cid1=cid
)
GROUP BY firstname, lastname, zip
HAVING ( COUNT(*) > 1 )
Unfortunately this additional WHERE clause has absolutely no impact on my result. Any suggestions?
Here you are:
Select a.*
From (
select c1.cid 'CID1', c2.cid 'CID2'
from Customer c1
join Customer c2 on c1.firstname=c2.firstname
and c1.lastname=c2.lastname and c1.zip=c2.zip
and c1.cid < c2.cid) a
Left Join (
Select cid1 'CID1', cid2 'CID2'
From ignoreforduplicatecustomer one
Union
Select cid2 'CID1', cid1 'CID2'
From ignoreforduplicatecustomer two) b on a.cid1 = b.cid1 and a.cid2 = b.cid2
where b.cid1 is null
This will get you the IDs of duplicate records from customer
table, which are not in table ignoreforduplicatecustomer
.
Tested with:
CREATE TABLE IF NOT EXISTS `customer` (
`CID` int(11) NOT NULL AUTO_INCREMENT,
`Firstname` varchar(50) NOT NULL,
`Lastname` varchar(50) NOT NULL,
`ZIP` varchar(10) NOT NULL,
PRIMARY KEY (`CID`))
ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=100 ;
INSERT INTO `customer` (`CID`, `Firstname`, `Lastname`, `ZIP`) VALUES
(1, 'John', 'Smith', '1234'),
(2, 'John', 'Smith', '1234'),
(3, 'John', 'Smith', '1234'),
(4, 'Jane', 'Doe', '1234');
And:
CREATE TABLE IF NOT EXISTS `ignoreforduplicatecustomer` (
`CID1` int(11) NOT NULL,
`CID2` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `ignoreforduplicatecustomer` (`CID1`, `CID2`) VALUES
(1, 2);
Results for my test setup are:
CID1 CID2
1 3
2 3