Search code examples
sqlsql-serversql-server-2012

Reallocate duplicate record references and remove redundant records


I have duplicated data in Table 1. All information in the fields listed below must be identical to be considered a duplicate. The table has other columns but these are inconsequential. The unique identifier is PDS_Item_Ref.

PDS_Item_Ref ItemRef Age Life Condition Comments HS Picturefile Immediate DETAILS_REF
1830 32976 5 26 Average No access 0 NULL 0 16
1872 32976 5 26 Average No access 0 NULL 0 16
1900 32976 5 26 Average No access 0 NULL 0 16

These records are referenced to other data in Table 2 using the PDS_Item_Ref column. Some or all of the various records may be referenced.

Collection_Ref PDS_Item_Ref Quantity
1000 1830 3
1001 1872 5
1002 1900 6
1003 1830 6

I need to update the PDS_Item_Ref field in Table 2 to select the duplicated item with the lowest [PDS_Item_Ref] and then delete the remainder of the now redundant records in Table 1. In the example above, item 1830 would be retained and all others would be removed.


Solution

  • This can be done in a lot of ways, I prefer to use a three-step model:

    1. Get duplicates into a temporary table
    2. Update referenced tables
    3. Remove duplicates from main table
    DROP TABLE #maintable
    DROP TABLE #dupes
    DROP TABLE #updatetable
    SELECT  *
    INTO #maintable
    FROM    (
            VALUES  (1830, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
            ,   (1872, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
            ,   (1900, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
            ,   (1901, 32976, 5, 26, N'Average', N'Has access', 0, N'NULL', 0, 16)
        ) t (PDS_Item_Ref,ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF)
        
    SELECT  PDS_item_ref, min_ref
    INTO #dupes
    FROM    (
        SELECT  *
        ,   MIN(PDS_item_ref) OVER(partition BY ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF ORDER BY pds_item_ref) AS min_ref
        FROM    #maintable
    ) x
    WHERE   x.min_ref <> PDS_Item_Ref
    
    -- create index...
    CREATE UNIQUE CLUSTERED INDEX IX_#dupes ON  #dupes (PDS_item_ref)
    
    SELECT  *
    INTO #updatetable
    FROM    (
    
        VALUES  (1000, 1830, 3)
        ,   (1001, 1872, 5)
        ,   (1002, 1900, 6)
        ,   (1003, 1830, 6)
    ) t (Collection_Ref,PDS_Item_Ref,Quantity)
    
    
    -- update dups...
    SET xact_abort ON;
    
    BEGIN TRAN
        
        UPDATE  t
        SET PDS_Item_Ref = d.min_ref
        FROM    #updatetable t
        INNER JOIN #dupes d
            ON  d.PDS_item_ref = t.PDS_item_ref
        
        -- update others...
        
        -- Finally delete
        
        DELETE  t
        FROM    #maintable t
        INNER JOIN #dupes d
            ON  d.PDS_item_ref = t.PDS_Item_Ref
    
        SELECT  *
        FROM    #updatetable
        SELECT  *
        FROM    #maintable
    COMMIT TRAN;
    

    First I create some test data and then get all duplicate rows. MIN(PDS_item_ref) OVER(partition BY ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF ORDER BY pds_item_ref) is a window aggregate designed for two things, to find groups by all the columns and get the minimum ID for each group.

    WHERE x.min_ref <> PDS_Item_Ref this ensures we only get the duplicate rows of the groups

    Finally we can update the related tables and delete from the main one. It's a good idea to do things in transaction so you don't end up with partially updated data in case something goes wrong.

    And most importantly, create a backup before starting working on this.