sql-server group-by duplicates partitioning dense-rank

Remove Partial Duplicate Rows in SQL Server 2016

I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).

I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...

Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them

IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
  DROP TABLE #Category;
GO

CREATE TABLE #Category 
(
    RowID INT NOT NULL,
    CategoryID INT NOT NULL,
    ProductCategory VARCHAR(100) NOT NULL,
    ProductName VARCHAR(100) NOT NULL,
    [SubCategory Level 1 ID] INT NOT NULL,
    [SubCategory Level 2 ID] INT NULL,
    [SubCategory Level 3 ID] INT NULL,  
    [SubCategory Level 4 ID] INT NULL
);

INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES 
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);


SELECT * FROM #Category; 
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9

Solution

If I understand this right, your logic is the following:

For each unique SubCategory Level 1, Product Category, and Product Name combination, you want to return the row which has the least amount of filled in SubCategory level data.

Using a quick dense_rank with partitions on the relevant fields, you can order the rows with less Sub Categories levels to be set to 1. Rows 2, 4, 6, and 9 should now be the only rows returned.

;with DataToSelect
as
(
    SELECT *,
           DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID] 
                    ORDER BY
                        CASE 
                            WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
                            WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
                            WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
                            END) as [ToInclude]
    FROM #Category
)
SELECT *
FROM 
    DataToSelect
WHERE 
    ToInclude != 1
ORDER BY 
    RowID

Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1, Product Category, and Product Name combination, they'll both be included. If you do not want this, just swap the dense_rank to row_number and add some alternative criteria on which should be selected first.