Search code examples
sqloracleperformancequery-optimizationexadata

Performance optimization with EXISTS


I'm working on a DWH supplying script, that deal with ~40M lines on an Oracle Exadata server.

I've got the following tables :

CREATE TABLE P.ARCHIVED_ID
(
    ID_ARCHIVED  VARCHAR2(31 BYTE) NOT NULL, -- CONSTRAINT UNIQUE / PRIMARY KEY
    IS_DELETE    CHAR(1 BYTE)      DEFAULT null
);

CREATE TABLE P.DWH_tableX
(
    ID      VARCHAR2(31 BYTE) NOT NULL, -- CONSTRAINT UNIQUE / PRIMARY KEY
    FIELD1  VARCHAR2(31 BYTE),
    FIELD2  VARCHAR2(31 BYTE),
    FIELD3  CHAR(2 BYTE),
    FIELD4  CHAR(1 BYTE)
);

As you can see, I've got a IS_DELETE flag that can be set to Y (and is NULL by default).

This is the standard query I need to use all around :

SELECT 
    ID,
    FIELD1,
    FIELD2,
    FIELD3,
    FIELD4
FROM P.DWH_tableX A
WHERE EXISTS (
    SELECT 1 
    FROM P.ARCHIVED_ID B 
    WHERE 
        A.ID = B.ID_ARCHIVED 
        AND IS_DELETE = 'Y'
    );

Question, is there a better way to optimize this than the following index ?

CREATE BITMAP INDEX P.I_IS_DELETE 
    ON P.ARCHIVED_ID(IS_DELETE)
LOGGING
TABLESPACE TBS_P_01
NOPARALLEL;

A few more points :

  • I'll never access P.ARCHIVED_ID without WHERE IS_DELETE = 'Y' (or only once when I update this field at the beginning of the script)
  • by default, Oracle don't register NULL in index, so I use NULL instead of "N", allowing me to make a partial ="Y" bitmap index, keeping it as small and useful as I can. FALSE
  • I use EXISTS, instead of IN, as Oracle recommend (ability to use index),
  • The part with IS_DELETE = 'Y' is for now 0%, but as it's a DWH, it's expected to only increase (as new lines are create everyday in the data source, and older are physically removed, setting them at IS_DELETE = 'Y' in the DWH)
  • the WHERE EXISTS is used around 20 time in my script, as it's how we deal to froze the archived values (move them and back on a temporary table, with insert append hint)

Solution

  • With any performance tuning exercise the devil is in the details. The following is some guesswork based on rules of thumb. You must run some performance benchmarks for yourself, using your actual tables and actual data.

    "is there a better way to optimize this than the following index"

    Almost certainly, yes.

    The benefit of bitmap indexes lies in having several of them. That way,when we issue a query filtering on those columns the optimizer can choose to execute a Star Transformation to find the rows in the intersection of the bitmaps. Even then, bitmap indexes on bivalent columns are as useful as columns with several different values.

    One bitmap index on its own, particularly one with only two values, isn't much use. Given the monstrous overheads of maintain bitmap indexes, and their concurrency issues, you should probably consider other options.

    "~40M lines on an Oracle Exadata server"

    Oracle have engineered the Exadata appliance for crunching through large volumes of data. This means looking for paths which support hash joins, Bloom filters and similar operations. With Exadata a common tuning technique is to drop an existing index rather than creating a new one. While cheaper than bitmaps, B-Tree indexes still incur costs resource (CPU, storage, memory) so it's worth considering whether using Exadata's brute force offers a lower cost overall. That's what we pay the big bucks for.

    However, even Exadata's raw power is a limited resource. So if you're going to run this query a lot (or rather the EXISTS sub-query) you will likely get a benefit from clustering the excluded rows. From your question it seems IS_DELETE is an updated attribute you can't use physical organisation at the table level (CTAS, attribute clustering). So a B-tree index on ARCHIVED_ID(IS_DELETE, ID_ARCHIVED) is the primary candidate.

    With compound indexes it's usually best to start with the least selective column, and that is true here. You're only interested in the rows where IS_DELETE = 'Y', so leading with that column will reduce the number of blocks the sub-query needs to visit. Leading with ID_ARCHIVED would mean the sub-query has to scan the whole index. Even with Exadata, we should always seek to minimize the work undertaken to get a set of records.

    But please benchmark this, or any other index.


    " For the bitmap index, I was thinking it only track Not Null value (so Y), I'm mistaken ?"

    'Fraid so. Bitmap indexes have an entry for each indexed field. So, unlike single-column B-tree indexes, they do index null entries.

    "I use EXISTS, instead of IN, as Oracle recommend (ability to use index),"

    Hmmm, not sure what Oracle recommendation you think you're citing there. Which is better really depends on the data. But basically if the sub-query is huge and the main query is relatively small then EXISTS is more efficient. But if the data volumes are reversed, the main query is huge and the sub-query is small IN is the better choice. Given your current starting position is the sub-query returns no rows (because nothing is deleted) and presumably relatively few records will be deleted in the near future it seems more likely that IN is the construct you should be using.

    But again: benchmark each approach and see what works better with your data.