I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).
I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...
Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them
IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
DROP TABLE #Category;
GO
CREATE TABLE #Category
(
RowID INT NOT NULL,
CategoryID INT NOT NULL,
ProductCategory VARCHAR(100) NOT NULL,
ProductName VARCHAR(100) NOT NULL,
[SubCategory Level 1 ID] INT NOT NULL,
[SubCategory Level 2 ID] INT NULL,
[SubCategory Level 3 ID] INT NULL,
[SubCategory Level 4 ID] INT NULL
);
INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);
SELECT * FROM #Category;
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9
If I understand this right, your logic is the following:
For each unique SubCategory Level 1
, Product Category
, and Product Name
combination, you want to return the row which has the least amount of filled in SubCategory level data.
Using a quick dense_rank
with partitions
on the relevant fields, you can order
the rows with less Sub Categories levels to be set to 1
. Rows 2
, 4
, 6
, and 9
should now be the only rows returned.
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude != 1
ORDER BY
RowID
Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1
, Product Category
, and Product Name
combination, they'll both be included. If you do not want this, just swap the dense_rank
to row_number
and add some alternative criteria on which should be selected first.