Search code examples
sqldatabaseoracleperformancetibero

SELECT COUNT query with many left JOIN clause is taking too long


This is the query to count the rows only

SELECT COUNT(*)
FROM SCHEMA.TABLE1 A
LEFT JOIN SCHEMA.TABLE2 B ON B.COL1 = A.COL2
LEFT JOIN SCHEMA.TABLE3 C ON C.COL2 = A.COL3
LEFT JOIN SCHEMA.TABLE4 D ON D.COL3 = A.COL4 
LEFT JOIN SCHEMA.TABLE5 E ON E.COL4 = A.COL5
LEFT JOIN SCHEMA.TABLE6 F ON F.COL5 = A.COL6 

TABLE1 have 3867 rows. TABLE2 have 1000 rows. TABLE3 have 305 rows. TABLE4 have 596 rows. TABLE5 have 3280 rows. TABLE6 have 5000 rows.

And the query took like 13 minutes and 30 seconds to finish and returns roughly 9 billion of rows.

I have already created indexes for all the joined column as well as the record id as it is auto created on table creation as the primary key. Here are all the indexes create query

CREATE INDEX SCHEMA.TABLE1_COL1_IDX ON SCHEMA.TABLE1 (COL1);
CREATE INDEX SCHEMA.TABLE2_COL2_IDX ON SCHEMA.TABLE2 (COL2);
CREATE INDEX SCHEMA.TABLE3_COL3_IDX ON SCHEMA.TABLE3 (COL3);
CREATE INDEX SCHEMA.TABLE4_COL4_IDX ON SCHEMA.TABLE4 (COL4);
CREATE INDEX SCHEMA.TABLE5_COL5_IDX ON SCHEMA.TABLE5 (COL5);
CREATE INDEX SCHEMA.TABLE6_COL6_IDX ON SCHEMA.TABLE6 (COL6);

I have also tried to run all these queries

CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE1');
CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE2');
CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE3');
CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE4');
CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE5');
CALL DBMS_STATS.gather_table_stats('SCHEMA', 'TABLE6');

But after all that, the query still took a lot of time to return the row counts. Is there any way out of this situation or optimization that we may do on this query maybe? Or the query just too much??


Solution

  • This is what happens when your join clauses define a many-to-many join to one or more of your tables. Every join should use a key that is unique on at least one side of the join. If a join key is non-unique on both sides of a join, it will create a partial Cartesian product, which has the potentially to vastly explode the number of rows in your result set. I say "partial" because the execution plan will not show you that there is a Cartesian join, but one is actually happening within each group of join key values. That is clearly what is happening to give you this many rows. In fact if you ever get more result rows than the largest table involved, you are guaranteed to have a many-to-many somewhere.

    A many-to-many which balloons an intermediate resultset in turn causes any subsequent hash joins to be very slow: the amount of I/O to temporary tablespace for building these hash join inputs on such a massive result set can take a very long time. Solve the excessive row problem and you solve the performance problem.

    So, go through each of your tables and test for uniqueness. For example:

    Test first join:

    SELECT KEYCOL1,COUNT(*)
      FROM SCHEMA.BASETABLE
     GROUP BY KEYCOL1
    HAVING COUNT(*) > 1
    
    SELECT KEYCOL1,COUNT(*)
      FROM SCHEMA.TABLE2
     GROUP BY KEYCOL1
    HAVING COUNT(*) > 1
    

    Test second join:

    SELECT KEYCOL2,COUNT(*)
      FROM SCHEMA.BASETABLE
     GROUP BY KEYCOL2
    HAVING COUNT(*) > 1
    
    SELECT KEYCOL2,COUNT(*)
      FROM SCHEMA.TABLE3
      GROUP BY KEYCOL2
    HAVING COUNT(*) > 1
    

    and so forth with each of your joins. If any of these tests gives you results for both queries, you have a many-to-many join. If that's the case with several tables, you have a many-to-many-to-many-to-many set of joins and that can easily produce billions of rows.

    To fix, determine what additional columns are needed in each join to describe uniqueness on one or the other side of the join. Once every one of your joins are one-to-one and only one of them is one-to-many and there are none remaining that are many-to-many, it will resolve your performance issue.

    There is however another way to get a many-to-many join that is less obvious. If you have multiple one-to-many joins, then the product of those multiple joins against themselves is a many-to-many. For example:

    join1:
    rowA-->1
    rowA-->2
    rowA-->3
    

    gives you three rows.

    join2:
    rowA-->M
    rowA-->N
    rowA-->O
    

    also gives you three rows. But there are three rows from join2 for every driving row which already includes 3 rows from join1. So you will end up with 9 rows:

    rowA-->1-->M
    rowA-->1-->N
    rowA-->1-->O
    rowA-->2-->M
    rowA-->2-->N
    rowA-->2-->O
    rowA-->3-->M
    rowA-->3-->N
    rowA-->3-->O
    

    The solution to this is to only have at most one one-to-many join in the same query block. If you need multiple tables with one-to-many relationships in the same query, you probabably need to use inline views (inner query blocks in the FROM clause) that do GROUP BYs on the join keys so that they are unique when joining at the parent block level. Example:

    SELECT A.*,
           B.column_I_need1,
           C.column_I_need2
      FROM SCHEMA.TABLE1 A
           LEFT JOIN (SELECT COL1,
                             MAX(column_I_need) column_I_need1
                        FROM SCHEMA.TABLE1
                       GROUP BY COL1) B ON A.COL1= B.COL1
           LEFT JOIN (SELECT COL2,
                             MAX(column_I_need) column_I_need2
                        FROM SCHEMA.TABLE2
                       GROUP BY COL2) C ON A.COL2= C.COL2