I am trying to write an SQL query to access a Microsoft SQL table. What I am hoping to accomplish is that I can find all rows that have duplicates and delete duplicates only if they are older than 30 days. Here is an example table:
INSERT INTO [dbo].[test]
(id1, id2, firstName, lastName, dayTime)
VALUES
(12, 13, 'Syed','Abbas','05-02-2023'),
(12, 13, 'Syed','Abbas','07-02-2023'),
(12, 14, 'Adam', 'Johnson', '07-02-2023'),
(10, 9, 'Monique', 'Brown', '03-03-2023')
And this is what I have written for my query:
DELETE T
FROM
(
SELECT *
, DupRank = ROW_NUMBER() OVER (
PARTITION BY id1, id2
ORDER BY (SELECT NULL)
)
FROM [dbo].[test]
) AS T
WHERE DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
The outcome I am trying to get is that only row 1 (12, 13, Syed, Abbas, 05-02-2023) will be deleted and the rest of the values will stay. However, when I run this query, it does not delete anything-- no errors, just 0 rows affected.
I have tried the separate parts of the query and they work fine (ie, when I just delete duplicates, it removes row 2, and when I just delete for older than 30 days, it removes rows 1 and 4). I am not sure if I am using the "and" clause incorrectly?
I'm guessing (although it's hard to say without seeing the query plan) that the non-deterministic ORDER BY
is causing problems.
When you write ORDER BY (SELECT NULL)
that means that the server is free to calculate the row-number in any order. So it could be that the older row is being numbered 1 and the newer row 2. Then when you filter to DupRank > 1 and dayTime < DATEADD(day, -30, GETDATE())
you are filtering out both rows.
So just use a deterministic numbering. The logical thing to do here would be to number from newest to oldest, so that you always keep the newest row and any others which are less than 30 days old.
DELETE T
FROM
(
SELECT *,
DupRank = ROW_NUMBER() OVER (
PARTITION BY T.id1, T.id2
ORDER BY T.dayTime DESC)
FROM dbo.test T
) AS T
WHERE T.DupRank > 1
AND T.dayTime < DATEADD(day, -30, GETDATE());