sqlsql-servert-sqlduplicates

How to delete duplicate rows that have same data except one column


I am trying to delete duplicate rows but leaving recent updated rows.

Here is illustration of my dataset:

enter image description here

I am trying to delete two rows that are older data (in this case 11/19's - highlighted in yellow).

This is dataset in text:

employee ID punch_start punch_end punch_hours date_load
John Doe 276567 Sep 30 2023 2:50PM Oct 1 2023 6:00AM 15.16666667 11/19/23 2:45 PM
Jane Doe 140037 Sep 30 2023 10:00PM Oct 1 2023 7:05AM 9.083333333 11/19/23 2:45 PM
John Doe 276567 Sep 30 2023 2:50PM Oct 1 2023 6:00AM 15.16666667 11/20/23 2:45 PM
Jane Doe 140037 Sep 30 2023 10:00PM Oct 1 2023 7:05AM 9.083333333 11/20/23 2:45 PM

I tried with this code (based one this post):

with todelete as 
(
    select  
      ,[employee]
      ,[ID]
      ,[punch_start]
      ,[punch_end]
      ,[punch_hours]
      ,row_number() over

      (
       partition by 
       [employee]
      ,[ID]
      ,[punch_start]
      ,[punch_end]
      ,[punch_hours]
      order by [date_load] desc) as seqnum
      from  [dbo].[dataset]
     )
    select * from todelete where seqnum > 1;
    delete from todelete where seqnum > 1;

But the result (on Select *) is:

enter image description here

How do we modify the code to distinguish/only select rows that have older [date_load]?


Solution

  • Maybe the DELETE in SQL Server has enough syntax to support this, but I generally do MERGE here.

    https://dbfiddle.uk/_5PR_82q

    merge into [dbo].[dataset] as tgt using (
        select *  
          from (
                  select  
                      ,[employee]
                      ,[ID]
                      ,[punch_start]
                      ,[punch_end]
                      ,[punch_hours]
                      ,[date_load]
                      ,row_number() over
    
                      (
                       partition by 
                       [employee]
                      ,[ID]
                      ,[punch_start]
                      ,[punch_end]
                      ,[punch_hours]
                      order by [date_load] desc) as seqnum
                  from  [dbo].[dataset]
                ) ranked
         where seqnum > 1
         ) as mrg
        ON (
           tgt.employee = mrg.employee and
           tgt.id = mrg.id and
           tgt.punch_start = mrg.punch_start and
           tgt.punch_end = mrg.punch_end and
           tgt.punch_hours = mrg.punch_hours and 
           tgt.date_load = mrg.date_load
           )
        WHEN MATCHED THEN DELETE;
    

    Also, maybe SQL Server does some magic that I'm not aware of. But in your original question

    delete from todelete where seqnum > 1;
    

    Probably will just fail for syntax. Because todelete is just a cte defined in a totally separate query.