Search code examples
sqlsql-servert-sqlduplicatessql-delete

How to remove rest of the rows with the same ID starting from the first duplicate?


I have the following structure for the table DataTable: every column is of the datatype int, RowID is an identity column and the primary key. LinkID is a foreign key and links to rows of an other table.

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
5       120      5      1       10
6       120      6      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
13      371      7      7       2
14      371      8      17      4
.................................
.................................

I'm trying to do a query which alters every LinkID batch in the following way:

  • Take every row with same LinkID (e.g. the first batch is the first 6 rows here)
  • Order them by the Order column
  • Look at Data and DataSpecifier columns as one compare unit (They can be thought as one column, called dataunit):
    • Keep as many rows from Order 1 onwards, until a duplicate dataunit comes by
    • Delete every row from that first duplicate onwards for that LinkID

So for the LinkID 120:

  • Sort the batch (already sorted here, but should still do it)
  • Start looking from the top (So Order=1 here), go as long as you don't see a duplicate.
  • Stop at the first duplicate Order = 5 (dataunit 1 10 was already seen).
  • Delete everything which has the LinkID=120 AND Order>=5

After similar process for LinkID 371 (and every other LinkID in the table), the processed table will look like this:

RowID   LinkID   Order  Data    DataSpecifier
1       120      1      1       1
2       120      2      1       3
3       120      3      1       10
4       120      4      1       13
7       371      1      6       2
8       371      2      3       5
9       371      3      8       1
10      371      4      10      1
11      371      5      7       2
12      371      6      3       3
.................................
.................................

I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:

DELETE FROM DataTable  
WHERE RowID IN (SELECT RowID
                FROM DataTable
                WHERE -- ?
                GROUP BY LinkID
                HAVING COUNT(*) > 1 -- ?
                ORDER BY [Order]);

But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.


Solution

  • You can use the ROW_NUMBER() window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID and a greater than or equal to any encountered Order with a row number greater than one.

    (I originally used a second CTE to get the MIN order, but I realized that it wasn't necessary as long as the join to order was greater than equal to any order where there was a second instance of the DataUnitId. By removing the MIN the query plan became quite simple and efficient.)

    WITH DataUnitInstances AS (
      SELECT *
        , ROW_NUMBER() OVER
          (PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId
      FROM DataTable
    )
    DELETE FROM DataTable
    FROM DataTable dt
    INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID 
      AND dup.[Order] <= dt.[Order]
      AND dup.DataUnitInstanceId > 1
    

    Here is the output from your sample data which matches your desired result:

    +-------+--------+-------+------+---------------+
    | RowID | LinkID | Order | Data | DataSpecifier |
    +-------+--------+-------+------+---------------+
    | 1     | 120    | 1     | 1    | 1             |
    | 2     | 120    | 2     | 1    | 3             |
    | 3     | 120    | 3     | 1    | 10            |
    | 4     | 120    | 4     | 1    | 13            |
    | 7     | 371    | 1     | 6    | 2             |
    | 8     | 371    | 2     | 3    | 5             |
    | 9     | 371    | 3     | 8    | 1             |
    | 10    | 371    | 4     | 10   | 1             |
    | 11    | 371    | 5     | 7    | 2             |
    | 12    | 371    | 6     | 3    | 3             |
    +-------+--------+-------+------+---------------+