Search code examples
sqldatabasemergeduplicatesrecords

Merging duplicate records in SQL where 'User' is duplicate but the PK and some columns are unique


I have 2 tables 'tbMemeberInfo' as well as 'tbDocument'.

When a Document is uploaded it records a field InfoID in tbDocument, which is the PK(MemberInfoID) in tbMemeberInfo.

There are however duplicates in tbMemeberInfo. Users have an 'AgreementNo' and an 'IDNumber' and the duplicate records only contain 1 or the other.

I need to merge these records to insert the 'AgreementNo' and 'IDNumber' into 1 of the records and remove the other.

Here are the table structures and code I have tried so far...

tbMemberInfo

MemberInfoID   AgreementNo  IDNumber     DOB                  Initials  FirstName   LastName   Language      Role    CountryID     Email
861616         124346665    NULL         1976-08-24 00:00:00.000    DV   DAMIAN     Example     English      Member         1     damian.example@mail.com
866185          NULL      7608241234123  1976-08-24 00:00:00.000    DV   DAMIAN     Example     English      Member         1     damian.example@mail.com

tbDocument

DocumentID  r_object_id      DocumentTypeID   UniqueDocumentNo  ContentLength   ContentType        FileName                                      CreatedUserID     CreatedDate                 InfoID   
293787      0900d431800bc987    13            PPS156329L         1753819        application/pdf    Example_DV_PROV_APP_2009110316140300[1].pdf     362            2010-01-13 16:21:46.250       861616
293794      0900d431800bc998    530           PPS156335O         66750          image/tiff         Example, DV DRS REPORT.tif                      362            2010-01-13 16:26:48.420       861616

SQL Code

DECLARE
@MemberInfoID int
,@AgreementNo varchar(50)
,@IDNumber varchar
,@DOB datetime
,@TitleID int
,@FirstName varchar(150)
,@LastName varchar(150)
,@ModifiedDate datetime 

SELECT @AgreementNo = AgreementNo, @IDNumber = IDNumber, @FirstName = FirstName, @LastName = LastName, @DOB = DOB
FROM tbMemberInfo mi
INNER JOIN tbDocument d
ON mi.MemberInfoID = d.InfoID
WHERE (mi.AgreementNo = '') OR (mi.IDNumber = '') 

--SELECT @IDNumber = IDNumber  From tbMemberInfo mi
--INNER JOIN tbDocument d
--ON mi.MemberInfoID = d.InfoID
--WHERE (mi.AgreementNo = '') 

--SELECT @AgreementNo = AgreementNo From tbMemberInfo mi
--INNER JOIN tbDocument d
--ON mi.MemberInfoID = d.InfoID
--WHERE (mi.IDNumber  = '') AND (FirstName = @FirstName) AND (LastName = @LastName) AND (DOB = @DOB)

UPDATE tbMemberInfo
SET [IDNumber] = @IDNumber, [AgreementNo] = @AgreementNo, ModifiedDate  = GETDATE() 
FROM tbMemberInfo mi
    INNER JOIN tbDocument d
        ON mi.MemberInfoID = d.InfoID   
 WHERE (IDNumber = '') OR (AgreementNo = '') AND (FirstName = @FirstName)
        AND (LastName = @LastName) AND (DOB = @DOB)
 GROUP BY MemberInfoID

None of this is working. It puts a '7' into all the IDNumber columns. Any ideas on how to do this. I havn't got round to deleting the duplicates, I want to merge first.


Solution

  • You should use IDNumber is NULL instead of = ''

    Your where clause has some problems, you should write something like:

    WHERE (IDNumber is Null or AgreementNo is NULL) AND (FirstName = @FirstName AND LastName = @LastName AND DOB = @DOB)
    

    my suggestion would be to create a tmp table with the exact same schema as tbMemberInfo (this will help you organize your steps and simplify the task a little bit); group entries in the way you want and merge them into the tmp table (if I understand correctly you are using FirstName, LastName and DateOfBirth to identify distinct entries). And at the end truncate everything in tbMemberInfo and fill it up with what you have in tmp table.. Here is a little more detail:

    INSERT INTO tmp
    SELECT * FROM tbMemberInfo;
    
    UPDATE tmp t1 INNER JOIN tbMemberInfo t2
            ON    t1.FirstName = t2.FirstName 
            AND   t1.LastName = t2.LastName 
            AND   t1.DOB = t2.DOB 
            SET   t1.AgreementNo = t2.AgreementNo 
            WHERE t1.AgreementNo IS NULL 
            AND   t1.MemberInfoID != t2.MemberInfoID;
    
    UPDATE tmp t1 INNER JOIN tbMemberInfo t2
            ON    t1.FirstName = t2.FirstName 
            AND   t1.LastName = t2.LastName 
            AND   t1.DOB = t2.DOB 
            SET   t1.IDNumber = t2.IDNumber 
            WHERE t1.IDNumber is NULL 
            AND   t1.MemberInfoID != t2.MemberInfoID;
    
    -- Just to make sure there are no entries left without a value
    SELECT * FROM tmp WHERE IDNumber is NULL OR AgreementNo is NULL;
    
    -- now we are going to keep the row with the lowest ID
    DELETE t1 FROM tmp t1, tmp t2 
        WHERE t1.MemberInfoID < t2.MemberInfoID 
        AND   t1.FirstName = t2.FirstName 
        AND   t1.LastName = t2.LastName 
        AND   t1.DOB = t2.DOB;
    
    -- update the InfoID intbDocument table
    UPDATE tbDocument t1 JOIN tbMemberInfo t2 
               ON    t1.InfoID = t2.MemberInfoID
               JOIN  tmp t2 
               ON    t2.FirstName = t3.FirstName 
               AND   t2.LastName = t3.LastName 
               AND   t2.DOB = t3.DOB
                   SET   t1.InfoID = t3.tbMemberInfoID
                   WHERE t1.InfoID NOT IN (SELECT MemberInfoID FROM tmp);
    
    TRUNCATE tbMemberInfo;
    
    INSERT INTO tbMemberInfo
    SELECT * FROM tmp;
    

    NOTE: I did not test these queries and just wrote them very quickly, so they are probably buggy, but you'll get the idea and can fix them yourself. Don't run these on your original data, make a duplicate and test them first.