I'm working on a DWH supplying script, that deal with ~40M lines on an Oracle Exadata server.
I've got the following tables :
CREATE TABLE P.ARCHIVED_ID
(
ID_ARCHIVED VARCHAR2(31 BYTE) NOT NULL, -- CONSTRAINT UNIQUE / PRIMARY KEY
IS_DELETE CHAR(1 BYTE) DEFAULT null
);
CREATE TABLE P.DWH_tableX
(
ID VARCHAR2(31 BYTE) NOT NULL, -- CONSTRAINT UNIQUE / PRIMARY KEY
FIELD1 VARCHAR2(31 BYTE),
FIELD2 VARCHAR2(31 BYTE),
FIELD3 CHAR(2 BYTE),
FIELD4 CHAR(1 BYTE)
);
As you can see, I've got a IS_DELETE
flag that can be set to Y
(and is NULL by default).
This is the standard query I need to use all around :
SELECT
ID,
FIELD1,
FIELD2,
FIELD3,
FIELD4
FROM P.DWH_tableX A
WHERE EXISTS (
SELECT 1
FROM P.ARCHIVED_ID B
WHERE
A.ID = B.ID_ARCHIVED
AND IS_DELETE = 'Y'
);
Question, is there a better way to optimize this than the following index ?
CREATE BITMAP INDEX P.I_IS_DELETE
ON P.ARCHIVED_ID(IS_DELETE)
LOGGING
TABLESPACE TBS_P_01
NOPARALLEL;
A few more points :
P.ARCHIVED_ID
without WHERE IS_DELETE = 'Y'
(or only once when I update this field at the beginning of the script)NULL
in index, so I use NULL
instead of "N"
, allowing me to make a partial ="Y"
bitmap index, keeping it as small and useful as I can.EXISTS
, instead of IN
, as Oracle recommend (ability to use index),IS_DELETE = 'Y'
is for now 0%, but as it's a DWH, it's expected to only increase (as new lines are create everyday in the data source, and older are physically removed, setting them at IS_DELETE = 'Y'
in the DWH)WHERE EXISTS
is used around 20 time in my script, as it's how we deal to froze the archived values (move them and back on a temporary table, with insert append
hint)With any performance tuning exercise the devil is in the details. The following is some guesswork based on rules of thumb. You must run some performance benchmarks for yourself, using your actual tables and actual data.
"is there a better way to optimize this than the following index"
Almost certainly, yes.
The benefit of bitmap indexes lies in having several of them. That way,when we issue a query filtering on those columns the optimizer can choose to execute a Star Transformation to find the rows in the intersection of the bitmaps. Even then, bitmap indexes on bivalent columns are as useful as columns with several different values.
One bitmap index on its own, particularly one with only two values, isn't much use. Given the monstrous overheads of maintain bitmap indexes, and their concurrency issues, you should probably consider other options.
"~40M lines on an Oracle Exadata server"
Oracle have engineered the Exadata appliance for crunching through large volumes of data. This means looking for paths which support hash joins, Bloom filters and similar operations. With Exadata a common tuning technique is to drop an existing index rather than creating a new one. While cheaper than bitmaps, B-Tree indexes still incur costs resource (CPU, storage, memory) so it's worth considering whether using Exadata's brute force offers a lower cost overall. That's what we pay the big bucks for.
However, even Exadata's raw power is a limited resource. So if you're going to run this query a lot (or rather the EXISTS sub-query) you will likely get a benefit from clustering the excluded rows. From your question it seems IS_DELETE
is an updated attribute you can't use physical organisation at the table level (CTAS, attribute clustering). So a B-tree index on ARCHIVED_ID(IS_DELETE, ID_ARCHIVED)
is the primary candidate.
With compound indexes it's usually best to start with the least selective column, and that is true here. You're only interested in the rows where IS_DELETE = 'Y'
, so leading with that column will reduce the number of blocks the sub-query needs to visit. Leading with ID_ARCHIVED
would mean the sub-query has to scan the whole index. Even with Exadata, we should always seek to minimize the work undertaken to get a set of records.
But please benchmark this, or any other index.
" For the bitmap index, I was thinking it only track Not Null value (so Y), I'm mistaken ?"
'Fraid so. Bitmap indexes have an entry for each indexed field. So, unlike single-column B-tree indexes, they do index null
entries.
"I use EXISTS, instead of IN, as Oracle recommend (ability to use index),"
Hmmm, not sure what Oracle recommendation you think you're citing there. Which is better really depends on the data. But basically if the sub-query is huge and the main query is relatively small then EXISTS is more efficient. But if the data volumes are reversed, the main query is huge and the sub-query is small IN is the better choice. Given your current starting position is the sub-query returns no rows (because nothing is deleted) and presumably relatively few records will be deleted in the near future it seems more likely that IN is the construct you should be using.
But again: benchmark each approach and see what works better with your data.