I have duplicated data in Table 1. All information in the fields listed below must be identical to be considered a duplicate. The table has other columns but these are inconsequential. The unique identifier is PDS_Item_Ref.
PDS_Item_Ref | ItemRef | Age | Life | Condition | Comments | HS | Picturefile | Immediate | DETAILS_REF |
---|---|---|---|---|---|---|---|---|---|
1830 | 32976 | 5 | 26 | Average | No access | 0 | NULL | 0 | 16 |
1872 | 32976 | 5 | 26 | Average | No access | 0 | NULL | 0 | 16 |
1900 | 32976 | 5 | 26 | Average | No access | 0 | NULL | 0 | 16 |
These records are referenced to other data in Table 2 using the PDS_Item_Ref column. Some or all of the various records may be referenced.
Collection_Ref | PDS_Item_Ref | Quantity |
---|---|---|
1000 | 1830 | 3 |
1001 | 1872 | 5 |
1002 | 1900 | 6 |
1003 | 1830 | 6 |
I need to update the PDS_Item_Ref field in Table 2 to select the duplicated item with the lowest [PDS_Item_Ref] and then delete the remainder of the now redundant records in Table 1. In the example above, item 1830 would be retained and all others would be removed.
This can be done in a lot of ways, I prefer to use a three-step model:
DROP TABLE #maintable
DROP TABLE #dupes
DROP TABLE #updatetable
SELECT *
INTO #maintable
FROM (
VALUES (1830, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
, (1872, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
, (1900, 32976, 5, 26, N'Average', N'No access', 0, N'NULL', 0, 16)
, (1901, 32976, 5, 26, N'Average', N'Has access', 0, N'NULL', 0, 16)
) t (PDS_Item_Ref,ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF)
SELECT PDS_item_ref, min_ref
INTO #dupes
FROM (
SELECT *
, MIN(PDS_item_ref) OVER(partition BY ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF ORDER BY pds_item_ref) AS min_ref
FROM #maintable
) x
WHERE x.min_ref <> PDS_Item_Ref
-- create index...
CREATE UNIQUE CLUSTERED INDEX IX_#dupes ON #dupes (PDS_item_ref)
SELECT *
INTO #updatetable
FROM (
VALUES (1000, 1830, 3)
, (1001, 1872, 5)
, (1002, 1900, 6)
, (1003, 1830, 6)
) t (Collection_Ref,PDS_Item_Ref,Quantity)
-- update dups...
SET xact_abort ON;
BEGIN TRAN
UPDATE t
SET PDS_Item_Ref = d.min_ref
FROM #updatetable t
INNER JOIN #dupes d
ON d.PDS_item_ref = t.PDS_item_ref
-- update others...
-- Finally delete
DELETE t
FROM #maintable t
INNER JOIN #dupes d
ON d.PDS_item_ref = t.PDS_Item_Ref
SELECT *
FROM #updatetable
SELECT *
FROM #maintable
COMMIT TRAN;
First I create some test data and then get all duplicate rows. MIN(PDS_item_ref) OVER(partition BY ItemRef,Age,Life,Condition,Comments,HS,Picturefile,Immediate,DETAILS_REF ORDER BY pds_item_ref)
is a window aggregate designed for two things, to find groups by all the columns and get the minimum ID for each group.
WHERE x.min_ref <> PDS_Item_Ref
this ensures we only get the duplicate rows of the groups
Finally we can update the related tables and delete from the main one. It's a good idea to do things in transaction so you don't end up with partially updated data in case something goes wrong.
And most importantly, create a backup before starting working on this.