Search code examples
sqloraclequery-optimizationoracle12c

What kind of indexes should be created on a volumnious table used in a join in sql query?


For the following query. ITEM_CTY_TAR_T is a volumnious table with 10 million+ rows. I have created primary index on this table with columns ITEM_NO, ITEM_TYPE, GA_CODE_IMP, GA_TYPE_IMP, TAR_NO and VALID_DATE_FROM but on explaining the plan for this query, the optimizer did a FULL scan on the table. What index should be created to prevent full scan?

SELECT
    t.item_no, t.item_type, t.BU_CODE_RU, t.tar_no,
    VALID_DATE_FROM 
FROM
    (SELECT  
         tar.item_no, tar.item_type,
         org.BU_CODE_RU, tar.tar_no, tar.VALID_DATE_FROM,
         ROW_NUMBER() OVER (PARTITION BY tar.ITEM_NO, tar.ITEM_TYPE, org.BU_CODE_RU 
                            ORDER BY VALID_DATE_FROM DESC, tar.tar_no DESC) AS RowNumber  
     FROM 
         (SELECT 
              item_no, item_type, tar_no, valid_date_from,
              GA_CODE_IMP 
          FROM 
              ITEM_CTY_TAR_T tar 
          WHERE  
              GA_TYPE_IMP = 'CTY' 
              AND DATE '2023-08-06' >= TRUNC(tar.VALID_DATE_FROM)
              AND DATE '2023-08-06' <= NVL(trunc(tar.VALID_DATE_TO), DATE'9999-12-31')
              AND NVL(trunc(tar.DELETE_DATE), DATE '9999-12-31') >  DATE '2023-08-06') tar,
         (SELECT
              BU_CODE_RU, GA_CODE_CTY 
          FROM
              store_t 
          WHERE
              BU_TYPE = 'SO' AND BU_CODE_RU IS NOT NULL) org
     WHERE
         org.GA_CODE_CTY = tar.GA_CODE_IMP) t  
WHERE
    rownumber = 1

enter image description here


Solution

  • You don't want indexes to be used for this query. Indexes are only faster than full table scans when the ratio of target rows to the table size is extremely small. That is not likely to be the case here.

    You're looking for records active as of a given day, but in most type-2 tables that's a significant portion if not a majority of the table. Neither GA_TYPE_IMP = 'CTY' nor BU_TYPE = 'SO' are likely to narrow down your results enough to justify index use. A full table scan is optimal in this situation.

    I suggest that instead of looking at indexes, you hint the query for a modest amount of parallelism:

    SELECT /*+ parallel(4) */
        t.item_no, t. . . .
    

    Now, if your records undergo constant change whereby your typical row version lasts only for a day or so and you have thousands of versions per key such that asking for records active as of Aug 6 2023 will give you only a tiny percentage of the table (much less than 1%), and you typically only want a very recent date (like today or yesterday), then those from/to dates become important and that may push you to want to use indexes. If that's the case, then your best bet is a combined index but it won't perform well - it will start at the right place on the first column (VALID_DATE_FROM) but will have to scan the remainder of the index to test the other column (VALID_DATE_TO):

    CREATE INDEX i_index ON ITEM_CTY_TAR_T (VALID_DATE_FROM,VALID_DATE_TO);
    

    And to use it at all you'd have to remove your date functions on at least the leading column:

     AND DATE '2023-08-06' + 1 > tar.VALID_DATE_FROM
     AND DATE '2023-08-06' < NVL(tar.VALID_DATE_TO, DATE'9999-12-31')
    

    Because your predicate on the first indexed column is an inequality, there's no hope of seeking on the second column so having a function there doesn't really hurt. But scanning the remainder of the index if you are looking at only recent dates might be cheaper than scanning the full table since the narrower index, especially only the tail end of it, will be smaller.

    Index scanning however uses inefficient single block reads following a linked list of non-contiguous blocks, and each row found that matches the predicates is another inefficient single block read from the table segment. This can really add up. Whereas a full table scan uses much more efficient multiblock reads on a single pass and can also bypass the buffer cache and use parallelism, so is much more efficient at reading large volumes of data. Which one is better can only really be known by testing in your environment with your data.