Search code examples
sqloracle

Optimizing sql query with subselect list in clause


I'm using oracle 11g and trying to optimize a query.

The basic structure of the query is:

SELECT val1, val2, val3,
FROM 
table_name
WHERE
val1 in (subselect statement is here, it selects a list of possible values for 
    val1 from another table) 
and val5>=X and val5<=Y
group by val1
order by val2 desc;

My issue is that when I use a subselect, the cost is 3130. If I fill in the results of the subselect by hand - so, for example

field1 in (1, 2, 3, 4, 5, 6) 

Where (1, 2, 3, 4, 5, 6) is the results of the subselect, which in this case is all possible values of field 1, the cost of the query is 14, and oracle uses an "inlist iterator" for the group by part of the query. The results of the two queries are identical.

My question is how to mimic the behaviour of manually listing the possible values of field1 with a subselect statement. The reason I don't list those values in the query is that the possible values change based on one of the other fields, so the subselect is pulling the possible values of field1 from a 2nd table based on, say, field2.

I have an index of val1, val5, so it isn't doing any full table scans - it does do a range scan in both cases, but in the subselect case the range scan is much more expensive. However it isn't the most expensive part of the subselect query. The most expensive part is the group by, which is a HASH.

Edit - Yes, the query isn't syntactically correct - I didn't want to put up anything too specific. The actual query is fine - the selects use valid group by functions.

The subselect returns 6 values, but it can be anywhere from 1-50 or so based on the other value.

Edit2 - What I ended up doing was 2 separate queries so I could generate the list used in the subselect. I actually tried a similar test in sqlite, and it does the same thing, so this isn't just Oracle.


Solution

  • what you are seeing is a result of the IN () bieng subject to bind variable peeking. when you have histograms you write a query like "where a = 'a'" oracle will use the histogram to guess how many rows will be returned (same idea with an inlist operator, which iterates for each item and aggregates rows). if no histograms it will make a guess in the form of rows/distinct values. In a subquery oracle doesn't do this (in most cases..there is a unique case where it does).

    for example:

    SQL> create table test
      2  (val1 number, val2 varchar2(20), val3 number);
    
    Table created.
    
    Elapsed: 00:00:00.02
    SQL>
    SQL> insert into test select 1, 'aaaaaaaaaa', mod(rownum, 5) from dual connect by level <= 100;
    
    100 rows created.
    
    Elapsed: 00:00:00.01
    SQL> insert into test select 2, 'aaaaaaaaaa', mod(rownum, 5) from dual connect by level <= 1000;
    
    1000 rows created.
    
    Elapsed: 00:00:00.02
    SQL> insert into test select 3, 'aaaaaaaaaa', mod(rownum, 5) from dual connect by level <= 100;
    
    100 rows created.
    
    Elapsed: 00:00:00.00
    SQL> insert into test select 4, 'aaaaaaaaaa', mod(rownum, 5) from dual connect by level <= 100000;
    
    100000 rows created.
    

    so i have a table with 101200 rows. for VAL1 , 100 are "1" 1000 are "2" 100 are "3" and 100k are "4".

    now if histograms are gathered (and we do want them in this case)

    SQL> exec dbms_stats.gather_table_stats(user , 'test', degree=>4, method_opt=>'for all indexed columns size 4', estimate_percent=>100);
    
    SQL> exec dbms_stats.gather_table_stats(user , 'lookup', degree=>4, method_opt =>'for all indexed columns size 3', estimate_percent=>100);
    

    we see the following:

    SQL> explain plan for select * from test where val1 in (1, 2, 3) ;
    
    Explained.
    
    SQL> @explain ""
    
    Plan hash value: 3165434153
    
    --------------------------------------------------------------------------------------
    | Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |       |  1200 | 19200 |    23   (0)| 00:00:01 |
    |   1 |  INLIST ITERATOR             |       |       |       |            |          |
    |   2 |   TABLE ACCESS BY INDEX ROWID| TEST  |  1200 | 19200 |    23   (0)| 00:00:01 |
    |*  3 |    INDEX RANGE SCAN          | TEST1 |  1200 |       |     4   (0)| 00:00:01 |
    --------------------------------------------------------------------------------------
    

    vs

    SQL> explain plan for select * from test where val1 in (select id from lookup where str = 'A') ;
    
    Explained.
    
    SQL> @explain ""
    
    Plan hash value: 441162525
    
    ----------------------------------------------------------------------------------------
    | Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    ----------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |         | 25300 |   518K|   106   (3)| 00:00:02 |
    |   1 |  NESTED LOOPS                |         | 25300 |   518K|   106   (3)| 00:00:02 |
    |   2 |   TABLE ACCESS BY INDEX ROWID| LOOKUP  |     1 |     5 |     1   (0)| 00:00:01 |
    |*  3 |    INDEX UNIQUE SCAN         | LOOKUP1 |     1 |       |     0   (0)| 00:00:01 |
    |*  4 |   TABLE ACCESS FULL          | TEST    | 25300 |   395K|   105   (3)| 00:00:02 |
    ----------------------------------------------------------------------------------------
    

    where lookup table is

    SQL> select * From lookup;
    
            ID STR
    ---------- ----------
             1 A
             2 B
             3 C
             4 D
    

    (str is unique indexed and has histograms).

    notice a bang on cardinality of 1200 for the inlist and a good plan, but a wildly inaccurate one on the sub query? Oracle hasn't computed histograms on the join condition, instead it has said "look, i dont know what id will be, so ill guess total rows(100k+1000+100+100)/distinct values(4) = 25300 and use that. as such its picked a full table scan.

    that's all great, but how to fix it? if you know that this sub query will match a small number of rows (we do). then you have to hint the outer query to try to have it use an index. like:

    SQL> explain plan for select /*+ index(t) */ * from test t where val1 in (select id from lookup where str = 'A') ;
    
    Explained.
    
    SQL> @explain
    
    Plan hash value: 702117913
    
    ----------------------------------------------------------------------------------------
    | Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
    ----------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |         | 25300 |   518K|   456   (1)| 00:00:06 |
    |   1 |  NESTED LOOPS                |         | 25300 |   518K|   456   (1)| 00:00:06 |
    |   2 |   TABLE ACCESS BY INDEX ROWID| LOOKUP  |     1 |     5 |     1   (0)| 00:00:01 |
    |*  3 |    INDEX UNIQUE SCAN         | LOOKUP1 |     1 |       |     0   (0)| 00:00:01 |
    |   4 |   TABLE ACCESS BY INDEX ROWID| TEST    | 25300 |   395K|   455   (1)| 00:00:06 |
    |*  5 |    INDEX RANGE SCAN          | TEST1   | 25300 |       |    61   (2)| 00:00:01 |
    ----------------------------------------------------------------------------------------
    

    another thing is in my particular case. as val1=4 is most of the table, lets say i have my standard query: select * from test t where val1 in (select id from lookup where str = :B1);

    for the possible :B1 inputs. if i know that the valid values passed in are A, B and C (ie not D which maps to id=4) . i can add this trick:

    SQL> explain plan for select  * from test t where val1 in (select id from lookup where str = :b1 and id in (1, 2, 3)) ;
    
    Explained.
    
    SQL> @explain ""
    
    Plan hash value: 771376936
    
    --------------------------------------------------------------------------------------------------
    | Id  | Operation                     | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT              |                  |   250 |  5250 |    24   (5)| 00:00:01 |
    |*  1 |  HASH JOIN                    |                  |   250 |  5250 |    24   (5)| 00:00:01 |
    |*  2 |   VIEW                        | index$_join$_002 |     1 |     5 |     1 (100)| 00:00:01 |
    |*  3 |    HASH JOIN                  |                  |       |       |            |          |
    |*  4 |     INDEX RANGE SCAN          | LOOKUP1          |     1 |     5 |     0   (0)| 00:00:01 |
    |   5 |     INLIST ITERATOR           |                  |       |       |            |          |
    |*  6 |      INDEX UNIQUE SCAN        | SYS_C002917051   |     1 |     5 |     0   (0)| 00:00:01 |
    |   7 |   INLIST ITERATOR             |                  |       |       |            |          |
    |   8 |    TABLE ACCESS BY INDEX ROWID| TEST             |  1200 | 19200 |    23   (0)| 00:00:01 |
    |*  9 |     INDEX RANGE SCAN          | TEST1            |  1200 |       |     4   (0)| 00:00:01 |
    --------------------------------------------------------------------------------------------------
    

    now notice oracle has got a reasonable card (its pushed the 1,2,3 onto the TEST table and got 1200..not 100% accurate, as i was only filtering on noe of them but ive told oralce CERTAINLY NOT 4!