Search code examples
sqldatabaseoracle-databasedrop-duplicates

Two DELETE statements in Oracle to delete duplicates


We have a table with over 55k rows that have an identifying name duplicated. The name can vary and the number of duplicates with each name can vary. So I applied these 2 scripts for practice deleting duplicate records from a table. Is there a difference? Are there any issues with the scripts? The output seemed the same.

DELETE FROM RDSUSER.A_JOB 
WHERE ROWID IN (SELECT ROWID 
                FROM (SELECT ROWID
                     , ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP 
                      FROM RDSUSER.A_JOB) 
                WHERE DUP > 1);

DELETE FROM RDSUSER.A_JOB JOB
WHERE ROWID > (SELECT MIN(ROWID) 
               FROM A_JOB
               WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);

Solution

  • Is there a difference?

    Yes.

    DELETE FROM RDSUSER.A_JOB 
    WHERE ROWID IN (SELECT ROWID 
                    FROM (SELECT ROWID
                         , ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP 
                          FROM RDSUSER.A_JOB) 
                    WHERE DUP > 1);
    

    Will PARTITION BY JOB_NAME and then ORDER BY JOB_NAME and since all the JOB_NAME within the partition are going to be identical then the ORDER BY clause is non-deterministic and the rows within the partition will be given an effectively random numbering for the rows and it is not guaranteed which rows in the partition will be kept/deleted.

    This means that if you run the query and then ROLLBACK and run the query again then the set of deleted rows may be different the second time (for example, if you run the query on a parallel system).

    DELETE FROM RDSUSER.A_JOB JOB
    WHERE ROWID > (SELECT MIN(ROWID) 
                   FROM A_JOB
                   WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);
    

    Will always keep the minimum ROWID for each JOB_NAME and the selection of rows to be kept/deleted will be deterministic.

    This means that if you run the query and then ROLLBACK the change and run the delete a second time then an identical set of rows will be deleted.


    If you want the queries to function identically then you can use:

    DELETE FROM RDSUSER.A_JOB 
    WHERE ROWID IN (SELECT ROWID 
                    FROM (
                      SELECT ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY ROWID)
                               AS DUP 
                      FROM   RDSUSER.A_JOB
                    ) 
                    WHERE DUP > 1);
    

    An example of the random ordering is:

    CREATE TABLE a_job (
      id       NUMBER(5,0) GENERATED ALWAYS AS IDENTITY,
      job_name VARCHAR2(20)
    );
    
    INSERT INTO a_job (job_name)
    SELECT 'a' FROM DUAL CONNECT BY LEVEL <= 3 UNION ALL
    SELECT 'b' FROM DUAL CONNECT BY LEVEL <= 2 UNION ALL
    SELECT 'c' FROM DUAL CONNECT BY LEVEL <= 5;
    
    INSERT INTO a_job (job_name)
    SELECT 'a' FROM DUAL CONNECT BY LEVEL <= 3;
    

    Then after:

    DELETE FROM /*RDSUSER.*/A_JOB 
    WHERE ROWID IN (SELECT ROWID 
                    FROM (SELECT ROWID
                         , ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP 
                          FROM /*RDSUSER.*/A_JOB) 
                    WHERE DUP > 1);
    

    The table may contain:

    ID JOB_NAME
    4 b
    10 c
    12 a

    But if you ROLLBACK and then:

    DELETE FROM /*RDSUSER.*/A_JOB JOB
    WHERE ROWID > (SELECT MIN(ROWID) 
                   FROM A_JOB
                   WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);
    

    Then the output may be:

    ID JOB_NAME
    1 a
    4 b
    6 c

    db<>fiddle here