Search code examples
sql-serverperformancet-sqlduplicatesi2b2

Quicker way of finding duplicates in SQL Server


I'm trying to find a better way of finding duplicates in SQL Server. This took over 20 minutes to run with just over 300 million records before results started showing in the results window within SSMS. Another 22 minutes elapsed before it crashed.

Then SSMS threw this error after displaying 16,777,216 records:

An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown.

Schema:

ENCOUNTER_NUM - numeric(22,0)
CONCEPT_CD - varchar(50)
PROVIDER_ID - varchar(50)
START_DATE - datetime
MODIFIER_CD - varchar(100)
INSTANCE_NUM - numeric(18,0)


SELECT
    ROW_NUMBER() OVER (ORDER BY f1.[ENCOUNTER_NUM],f1.[CONCEPT_CD],f1.[PROVIDER_ID],f1.[START_DATE],f1.[MODIFIER_CD],f1.[INSTANCE_NUM]),
    f1.[ENCOUNTER_NUM], 
    f1.[CONCEPT_CD], 
    f1.[PROVIDER_ID], 
    f1.[START_DATE], 
    f1.[MODIFIER_CD], 
    f1.[INSTANCE_NUM]
FROM
    [dbo].[I2B2_OBSERVATION_FACT] f1
    INNER JOIN [dbo].[I2B2_OBSERVATION_FACT] f2 ON
        f1.[ENCOUNTER_NUM] = f2.[ENCOUNTER_NUM] 
        AND f1.[CONCEPT_CD] = f2.[CONCEPT_CD]
        AND f1.[PROVIDER_ID] = f2.[PROVIDER_ID]
        AND f1.[START_DATE] = f2.[START_DATE]
        AND f1.[MODIFIER_CD] = f2.[MODIFIER_CD]
        AND f1.[INSTANCE_NUM] = f2.[INSTANCE_NUM]

Solution

  • Not sure how much faster this is, but worth a try.

    SELECT
        COUNT(*) AS Dupes,
        f1.[ENCOUNTER_NUM], 
        f1.[CONCEPT_CD], 
        f1.[PROVIDER_ID], 
        f1.[START_DATE], 
        f1.[MODIFIER_CD], 
        f1.[INSTANCE_NUM]
    FROM
        [dbo].[I2B2_OBSERVATION_FACT] f1
    GROUP BY
        f1.[ENCOUNTER_NUM], 
        f1.[CONCEPT_CD], 
        f1.[PROVIDER_ID], 
        f1.[START_DATE], 
        f1.[MODIFIER_CD], 
        f1.[INSTANCE_NUM]
    HAVING
        COUNT(*) > 1