Search code examples
sqlstored-proceduressql-server-2012deduplication

Deduplication of imported records in SQL server


I have the following T_SQL Stored Procedure that is currently taking up 50% of the total time needed to run all processes on newly imported records into our backend analysis suite. Unfortunately, this data needs to be imported every time and is causing a bottleneck as our DB size grows.

Basically, we are trying to identify all duplicate in the records and keep only one of them.

DECLARE @status INT
SET @status = 3


DECLARE @contactid INT
DECLARE @email VARCHAR (100)


--Contacts
DECLARE email_cursor CURSOR FOR 
SELECT email FROM contacts WHERE  (reference  = @reference AND status = 1 ) GROUP BY email HAVING (COUNT(email) > 1)
OPEN email_cursor

FETCH NEXT FROM email_cursor INTO @email


WHILE @@FETCH_STATUS = 0
    BEGIN
        PRINT @email
        UPDATE contacts SET duplicate  = 1, status = @status  WHERE email = @email and reference = @reference  AND status = 1
        SELECT TOP 1 @contactid = id FROM contacts where  reference = @reference and email = @email AND duplicate = 1
        UPDATE contacts SET duplicate  =0, status = 1 WHERE id = @contactid
        FETCH NEXT FROM email_cursor INTO @email
    END


CLOSE email_cursor
DEALLOCATE email_cursor

I have added all the indexes I can see from query execution plans, but it may be possible to update the entire SP to run differently, as I have managed to do with others.


Solution

  • Use this single query to de-dup.

    ;with tmp as (
    select *
          ,rn=row_number() over (partition by email, reference order by id)
          ,c=count(1) over (partition by email, reference)
      from contacts
     where status = 1
    )
    update tmp
       set duplicate = case when rn=1 then 0 else 1 end
          ,status = case when rn=1 then 1 else 3 end
     where c > 1
    ;
    

    It will only de-dup among the records where status=1, and considers rows with the same (email,reference) combination as dups.