Suppose we have the following problem:
Given a table with one column 'X'
, containing some rows with random
integers from 1 to 100:
CREATE TABLE xtable(x) AS
SELECT ceil(dbms_random.value * 100)
FROM dual
CONNECT BY level <= 1000000;
We must delete duplicate rows so all the distinct integers remain in the table.
Let's consider the three solutions (with average execution times and optimizer plans) below.
I must add that experiments show:
3 * exp(0.6 * N)
We see that for the solution 2 optimizer plan give expectations unrelated to experimental results, and even opposite to them:
And in this experiments the presence or absence of gathered statistics for the table doesn't affect optimizer plans and execution times.
Please, explain why I can't trust the optimizer plan in case 2.
What causes the optimizer to ignore the obvious difference between linear and exponential complexity?
Solutions:
1.
DELETE xtable WHERE rowid IN (
SELECT ri from (
SELECT rowid AS ri,
row_number() OVER(PARTITION BY x ORDER BY null) AS rn
FROM xtable
)
WHERE rn > 1
)
Exe time: 14 - 16 secs
Plan:
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 1000000 | 15000000 | 5119 | 00:00:01 |
| 1 | DELETE | XTABLE | | | | |
| * 2 | HASH JOIN SEMI | | 1000000 | 15000000 | 5119 | 00:00:01 |
| 3 | TABLE ACCESS FULL | XTABLE | 1000000 | 3000000 | 280 | 00:00:01 |
| 4 | VIEW | VW_NSO_1 | 1000000 | 12000000 | 2976 | 00:00:01 |
| * 5 | VIEW | | 1000000 | 25000000 | 2976 | 00:00:01 |
| 6 | WINDOW SORT | | 1000000 | 3000000 | 2976 | 00:00:01 |
| 7 | TABLE ACCESS FULL | XTABLE | 1000000 | 3000000 | 280 | 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 2 - access(ROWID="RI")
* 5 - filter("RN">1)
2.
DELETE xtable WHERE (x, rowid) NOT IN (SELECT x, min(rowid) FROM xtable GROUP BY x)
Exe time: 15 - 17 secs
Plan:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
--------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 50000 | 150000 | 278162850 | 03:01:06 |
| 1 | DELETE | XTABLE | | | | |
| 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL | XTABLE | 1000000 | 3000000 | 281 | 00:00:01 |
| 4 | FILTER | | | | | |
| 5 | SORT GROUP BY NOSORT | | 1000000 | 3000000 | 280 | 00:00:01 |
| 6 | TABLE ACCESS FULL | XTABLE | 1000000 | 3000000 | 280 | 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 5 - access(INTERNAL_FUNCTION("X")=INTERNAL_FUNCTION("X") AND INTERNAL_FUNCTION(ROWID)=INTERNAL_FUNCTION("MIN(ROWID)"))
* 5 - filter(INTERNAL_FUNCTION(ROWID)=INTERNAL_FUNCTION("MIN(ROWID)") AND INTERNAL_FUNCTION("X")=INTERNAL_FUNCTION("X"))
3.
DELETE xtable a WHERE EXISTS(select 1 FROM xtable b WHERE a.x = b.x AND a.rowid < b.rowid)
Exe time: 970 - 990 sec
Plan:
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
----------------------------------------------------------------------------------------------
| 0 | DELETE STATEMENT | | 50000 | 300000 | 278208956 | 03:01:08 |
| 1 | DELETE | XTABLE | | | | |
| * 2 | FILTER | | | | | |
| 3 | NESTED LOOPS SEMI | | 50000 | 300000 | 278208956 | 03:01:08 |
| 4 | TABLE ACCESS FULL | XTABLE | 1000000 | 3000000 | 280 | 00:00:01 |
| * 5 | TABLE ACCESS BY ROWID RANGE | XTABLE | 50000 | 150000 | 278 | 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 2 - filter(:VAR2=:VAR1)
* 5 - access("B".ROWID>"A".ROWID)
Plans were obtained on Oracle 12.1.0.2.0
Please, explain why I can't trust the optimizer plan in case 2.
You should never trust the optimizer. CBO is 95% rigth, but you do not know which 5% are wrong.
Typical problem is that the execution plan shown using EXPLAIN PLAN
doesn't equal to the plan used by execution. (You do not say how you obtain the plan).
In doubt use DBMS_SQLTUNE.REPORT_SQL_MONITOR for long running queires to see the actual plan and the problematic parts.
What causes the optimizer to ignore the obvious difference between linear and exponential complexity?
See above and forget the cost compare of the plans. What you want to avoid while processing a whole table is the NESTED LOOP
processing.
This is exactly what happens in case 3.
| 3 | NESTED LOOPS SEMI | | 50000| 300000 | 278208956 | 03:01:08|
| 4 | TABLE ACCESS FULL |XTABLE | 1000000| 3000000 | 280 | 00:00:01|
| 5 | TABLE ACCESS BY ROWID RANGE |XTABLE | 50000| 150000 | 278 | 00:00:01|
You want to see SORT and HASH JOIN this is waht plan 1 shows.
In my opinion the plan 2 will not scale with the number of duplicated records (simple try it with a table having each row twice and see if you get the same elapsed time as in case 3). The optimizer can not estimate the number of duplicated records, so defensively estimates high number and therefore high cost.
Last but one remark - the theory says you should not observe linear behaviour but at best O(n * log(n))
.
Last remark - your test data are not realistic for a dups removal. Typical you have a large table with a small number of of dups. In your setup all records except for 100 are dups.
The cost of delete dominates the cost of finding dups so you observe linear behaviour.
Try with
CREATE TABLE xtable(x) AS
SELECT ceil(dbms_random.value * 100000000)
FROM dual
CONNECT BY level <= 1000000;
select count(*) total, count(*)- count(distinct x) to_be_deleted from xtable;
TOTAL TO_BE_DELETED
---------- -------------
1000000 5083
So you will remove 0.5% of the records. Now scale and you will observe completely other pattern.