Explain plan: select from 100K records table by indexed varchar2 field is not using the index

I agree with Tom Kyte about full table scan is not an evil source, but only when a table is relatively small. So, having additional index of such a table is redundant. However, a table with 100.000 records should not be considered as small but explain plan from such a table shows performed table full scan. So, I did small experiment on my laptop with Oracle installed locally:

1) first, created my_table:

CREATE TABLE my_table(
  "ID" NUMBER NOT NULL ENABLE, 
  "INVOICE_NO" VARCHAR2(10), 
  CONSTRAINT "test _PK" PRIMARY KEY ("ID")
)

2) Then, created index for invoice_no column (because will filter using it):

CREATE INDEX "my_table_index1" ON my_table (invoice_no)

3) Then, inserted 100K records:

DECLARE
  mod_val NUMBER;
BEGIN
  FOR i IN 1..100000 LOOP
    mod_val := MOD(i,6);
    IF (mod_val = 0) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-110');
    ELSIF (mod_val = 1) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-111');
    ELSIF (mod_val = 2) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-112');
    ELSIF (mod_val = 3) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-113');
    ELSIF (mod_val = 4) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-114');
    ELSIF (mod_val = 5) THEN
      INSERT INTO my_table (ID, INVOICE_NO) VALUES (i, '5570-115');
    END IF; 
  END LOOP;
  COMMIT;
END;

4) Then updated one random record (just for emphasizing the select):

BEGIN
  UPDATE my_table SET INVOICENO = 'exception' WHERE id = 50000;
  COMMIT;
END;

5) Then performed select with explain plan:

EXPLAIN PLAN FOR
  SELECT * FROM my_table WHERE invoice_no = 'exception';

6) Then grabbed the statistics:

 SELECT * FROM TABLE(dbms_xplan.display);

7) and got the results:

"PLAN_TABLE_OUTPUT"
"Plan hash value: 3804444429"
" "
"------------------------------------------------------------------------------"
"| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |"
"------------------------------------------------------------------------------"
"|   0 | SELECT STATEMENT  |          | 83256 |  1626K|   103   (1)| 00:00:02 |"
"|   1 |  TABLE ACCESS FULL| MY_TABLE | 83256 |  1626K|   103   (1)| 00:00:02 |"
"------------------------------------------------------------------------------"
" "
"Note"
"-----"
"   - dynamic sampling used for this statement (level=2)"

Conclusion: It's strange and smells "magic", why Oracle decided not to use the index on invoice_no field and scanned 83256 records? I agree my laptop is not overloaded with concurrent users, a table is not quite big in size (contain numbers and varchars), however, I don't like this magic and would like to know the reasons for such behaviour :)

UPDATE: I just added some dummy value (see below) to invoice_no field to all records - just to increase the size of the table, however, table full scan remains: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

UPDATE2: I also executed analyze table but the result is the same:

ANALYZE TABLE my_table COMPUTE STATISTICS;

UPDATE3: tried to force to use index but the result is the same (maybe wrong syntax?):

EXPLAIN PLAN FOR 
  SELECT /*+ INDEX(my_table my_table_index1) */ * FROM my_table t WHERE invoice_no = 'exception'

UPDATE4: Finally, was able "to tell Oracle" to use the index - executed the new gather table stats procedure:

BEGIN
  DBMS_STATS.GATHER_TABLE_STATS ( OWNNAME=>user
                             , TABNAME=>'my_table');
END;

Here's the output of explain plan:

"--------------------------------------------------------------------------------------"
"| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |"
"-----------------------------------------------------------------------------------------------"
"|   0 | SELECT STATEMENT            |                 |     1 |   294 |     5   (0)| 00:00:01 |"
"|   1 |  TABLE ACCESS BY INDEX ROWID| MY_TABLE        |     1 |   294 |     5   (0)| 00:00:01 |"
"|*  2 |   INDEX RANGE SCAN          | my_table_index1 |     1 |       |     4   (0)| 00:00:01 |"
"-----------------------------------------------------------------------------------------------"
" "
"Predicate Information (identified by operation id):"
"---------------------------------------------------"
" "
"   2 - access(""INVOICE_NO""='exception')"

So, it appears Oracle decides to use some querying approach at some time point, and is not updating it even if situation has changed. I agree with that but it's strange why it didn't selected the right approach for this test case when I just created, inserted and performed the select. Do we have always to execute DBMS_STATS.GATHER_TABLE_STATS at least at start to tell Oracle to use the best querying approach?

Solution

When you initially create the table, there are only 7 distinct values for INVOICE_NO. By default, therefore, Oracle expects that a query against the table which just specifies a predicate on INVOICE_NO will return roughly 1 out of every 7 rows (~14.3% of the rows) which is generally going to mean that a table scan would be more efficient than an index scan (the exact cutoff point will depend on a number of different parameters-- it is entirely possible that some systems would pick an index scan if it expects to retrieve 15% of the rows).

When you ran the query initially, there were no statistics on the table so Oracle was forced to do dynamic sampling (note the comment "dynamic sampling used for this statement (level=2)" in the query plan). That is designed to very quickly gather some basic statistics for the optimizer. However, dynamic sampling is designed to optimize for speed rather than accuracy so the quality of the statistics is generally less than optimal. In your first example, Oracle estimates that the query is returning 83256 rows (83.2% of the total) which probably means both that it overestimated the number of rows in the table and underestimated the number of distinct values in the INVOICE_NO column.

Had you gathered stats using

BEGIN
  DBMS_STATS.GATHER_TABLE_STATS ( OWNNAME=>user
                             , TABNAME=>'my_table');
END;

after step 4 but before step 5, assuming you haven't changed any of the DBMS_STATS default settings, you would have had better statistics but you would still (most likely) have a table scan. Oracle would estimate that 14286 rows (1 out of 7 rows).

SQL> SELECT * FROM TABLE(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3804444429

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          | 14286 |   195K|   104   (2)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| MY_TABLE | 14286 |   195K|   104   (2)| 00:00:02 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

   1 - filter("INVOICE_NO"='exception')

In order to get a better plan, you need to have a histogram on the INVOICE_NO column. This will tell Oracle that the data in the INVOIE_NO is not evenly distributed so some values (i.e. "exception") are much more selective than other columns. When you are gathering statistics, you can specify that you want to gather histograms on individual columns, on all indexed columns, or you can specify that you want Oracle to automatically determine which columns need histograms (we'll get back to that in a moment). If you want to force Oracle to gather histogramson all the indexed columns,

SQL> exec dbms_stats.gather_table_stats( 'SCOTT', 
                                         'MY_TABLE', 
                                          method_opt => 'FOR ALL INDEXED COLUMNS SIZE 254' );

PL/SQL procedure successfully completed.

Assuming that there are 255 or fewer distinct values for INVOICE_NO, this histogram will let Oracle track exactly how common each distinct value is (if there are more than 255 distinct values, then Oracle will need to combine adjacent values which may make your histograms less accurate).

In a default Oracle 10.2 or 11.2 install, the default method_opt setting is going to be "FOR ALL COLUMNS SIZE AUTO". This tells Oracle to gather histograms on whatever columns it determines would be appropriate. To do this, Oracle looks for columns where the distribution of data is highly skewed and where that column appears in predicates. So earlier, when I was talking about gathering statistics between steps 4 and 5, Oracle didn't gather a histogram on INVOICE_NO because while it knew that the data was skewed, it did not know that you were going to query the table based on that column.

After step 7, if you gathered statistics again using exactly the same command

BEGIN
  DBMS_STATS.GATHER_TABLE_STATS ( OWNNAME=>user
                             , TABNAME=>'my_table');
END;

Oracle would then see that there was a query against MY_TABLE that had a predicate on INVOICE_ID in the shared pool. That would allow it to realize that INVOICE_NO met both conditions to get a histogram so this time, it would gather a histogram on INVOICE_NO. This allows the optimizer to realize that your query is only returning 1 row and to realize that an index scan will be the most efficient plan

SQL> SELECT * FROM TABLE(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3377519735

-----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                 |     1 |    14 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| MY_TABLE        |     1 |    14 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | my_table_index1 |     1 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("INVOICE_NO"='exception')

So, the good news is that Oracle is smart enough to eventually figure out that it needs a histogram in order to produce the best plan for this query. The bad news is that if you don't tell Oracle this by gathering statistics including histograms where appropriate when you populate the data in a table, you are likely to get poor plans until Oracle figures out what it needs.

In a real system, you will generally be using bind variables rather than literals in the vast majority of your queries. When you use bind variables in queries against columns with histograms, you introduce a new set of issues. If you have a query in your application

 SELECT * 
   FROM my_table 
  WHERE invoice_no = :1;

you would want a table scan if you bound a value "5570-110" but you would want an index scan if you bound a value of "exception". In Oracle 10.2, Oracle will do bind variable peeking which means that when Oracle does a hard parse, it will peek at the value of the bind variable and generate a plan that optimizes for that bind value. Unfortunately, in 10g, you can only have one plan per query so you'll only get an optimal plan for one of the two cases at a time and which plan you get will depend on the luck of which bind value was encountered first. In 11g, you can get adaptive cursor sharing where Oracle maintains multiple query plans for different bind variable values though this introduces some extra complexity that you'll need to be aware of.

Oh, and as an aside, your hint didn't work because you used a case-sensitive name for your index. Your hint would need to use the case-sensitive index name. You would also need to use the alias name rather than the table name

SELECT /*+ INDEX(t "my_table_index1") */ * 
  FROM my_table t 
 WHERE invoice_no = 'exception'

This is one of the (many) reasons that using case-sensitive identifiers is generally a major pain.