We have a table with over 55k rows that have an identifying name duplicated. The name can vary and the number of duplicates with each name can vary. So I applied these 2 scripts for practice deleting duplicate records from a table. Is there a difference? Are there any issues with the scripts? The output seemed the same.
DELETE FROM RDSUSER.A_JOB
WHERE ROWID IN (SELECT ROWID
FROM (SELECT ROWID
, ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP
FROM RDSUSER.A_JOB)
WHERE DUP > 1);
DELETE FROM RDSUSER.A_JOB JOB
WHERE ROWID > (SELECT MIN(ROWID)
FROM A_JOB
WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);
Is there a difference?
Yes.
DELETE FROM RDSUSER.A_JOB
WHERE ROWID IN (SELECT ROWID
FROM (SELECT ROWID
, ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP
FROM RDSUSER.A_JOB)
WHERE DUP > 1);
Will PARTITION BY JOB_NAME
and then ORDER BY JOB_NAME
and since all the JOB_NAME
within the partition are going to be identical then the ORDER BY
clause is non-deterministic and the rows within the partition will be given an effectively random numbering for the rows and it is not guaranteed which rows in the partition will be kept/deleted.
This means that if you run the query and then ROLLBACK
and run the query again then the set of deleted rows may be different the second time (for example, if you run the query on a parallel system).
DELETE FROM RDSUSER.A_JOB JOB
WHERE ROWID > (SELECT MIN(ROWID)
FROM A_JOB
WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);
Will always keep the minimum ROWID
for each JOB_NAME
and the selection of rows to be kept/deleted will be deterministic.
This means that if you run the query and then ROLLBACK
the change and run the delete a second time then an identical set of rows will be deleted.
If you want the queries to function identically then you can use:
DELETE FROM RDSUSER.A_JOB
WHERE ROWID IN (SELECT ROWID
FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY ROWID)
AS DUP
FROM RDSUSER.A_JOB
)
WHERE DUP > 1);
An example of the random ordering is:
CREATE TABLE a_job (
id NUMBER(5,0) GENERATED ALWAYS AS IDENTITY,
job_name VARCHAR2(20)
);
INSERT INTO a_job (job_name)
SELECT 'a' FROM DUAL CONNECT BY LEVEL <= 3 UNION ALL
SELECT 'b' FROM DUAL CONNECT BY LEVEL <= 2 UNION ALL
SELECT 'c' FROM DUAL CONNECT BY LEVEL <= 5;
INSERT INTO a_job (job_name)
SELECT 'a' FROM DUAL CONNECT BY LEVEL <= 3;
Then after:
DELETE FROM /*RDSUSER.*/A_JOB
WHERE ROWID IN (SELECT ROWID
FROM (SELECT ROWID
, ROW_NUMBER() OVER(PARTITION BY JOB_NAME ORDER BY JOB_NAME) DUP
FROM /*RDSUSER.*/A_JOB)
WHERE DUP > 1);
The table may contain:
ID JOB_NAME 4 b 10 c 12 a
But if you ROLLBACK
and then:
DELETE FROM /*RDSUSER.*/A_JOB JOB
WHERE ROWID > (SELECT MIN(ROWID)
FROM A_JOB
WHERE JOB.JOB_NAME = A_JOB.JOB_NAME);
Then the output may be:
ID JOB_NAME 1 a 4 b 6 c
db<>fiddle here