Search code examples
sqlpostgresqlquery-performancepsqlodbc

How to optimize the execution time for this query


I have the following query:

SELECT "factures"."id" 
FROM   "factures" 
WHERE  ( "factures"."id" NOT IN (SELECT DISTINCT( "echeances"."facture_id" ) 
                                 FROM   "echeances" 
                                 WHERE  "echeances"."type_decheance" IN ( 2, 3, 4, 5, 8, 9 ) 
                                        AND "echeances"."facture_id" IS NOT NULL 
                                 LIMIT  100000)) <----- removing this limit makes the query take enormous time
ORDER  BY "factures"."id" DESC

Here is the explain analysis with limit 100 000 :

Index Only Scan Backward using factures_id_pkey on factures  (cost=93516.76..211292.17 rows=530570 width=4) (actual time=1425.701..11466.759 rows=963698 loops=1)
  Filter: (NOT (hashed SubPlan 1))
  Rows Removed by Filter: 99997
  Heap Fetches: 1063695
  SubPlan 1
    ->  Limit  (cost=0.43..93266.34 rows=100000 width=4) (actual time=0.022..1229.925 rows=100000 loops=1)
          ->  Unique  (cost=0.43..264837.37 rows=283959 width=4) (actual time=0.022..1090.692 rows=100000 loops=1)
                ->  Index Scan using echeances__facture_id__idx on echeances  (cost=0.43..262883.29 rows=781631 width=4) (actual time=0.020..819.735 rows=100167 loops=1)
                      Index Cond: (facture_id IS NOT NULL)
"                      Filter: (type_decheance = ANY ('{2,3,4,5,8,9}'::integer[]))"
                      Rows Removed by Filter: 156995
Planning time: 0.249 ms
Execution time: 11960.423 ms

Here is the explain without limit:

->  Unique  (cost=0.43..264837.37 rows=283959 width=4)
Index Only Scan Backward using factures_id_pkey on factures  (cost=0.86..142233669403.15 rows=530570 width=4)
  Filter: (NOT (SubPlan 1))
  SubPlan 1
    ->  Materialize  (cost=0.43..267367.16 rows=283959 width=4)
                ->  Index Scan using echeances__facture_id__idx on echeances  (cost=0.43..262883.29 rows=781631 width=4)
                      Index Cond: (facture_id IS NOT NULL)
"                      Filter: (type_decheance = ANY ('{2,3,4,5,8,9}'::integer[]))"

And here is the schema

Table "factures"
id

Table "echeances"
id
facture_id (fk)
type_decheance (integer)

The issue is that "factures" and "echeances" table have a huge amount of rows, if :

  • Limit is specified in the subquery, for example limit 100000, query time is fast

  • Limit is not specified in the subquery, it takes a lot of time, I had to stop it after waiting more than 15 minutes.

The goal is to have this query run without limit and in a reasonable amount of time.


Solution

  • Switch to NOT EXISTS:

    SELECT f.id
    FROM factures f
    WHERE NOT EXISTS (SELECT 1
                      FROM echeances e
                      WHERE e.facture_id = f.id AND
                            e.type_decheance IN ( 2, 3, 4, 5, 8, 9 ) 
                     )
    ORDER  BY f.id DESC;
    

    Note that I removed all the double quoting. Don't quote your identifiers. It just makes queries harder to write and to read.

    Then you want an index on echeances(facture_id, type_decheance). And that should be pretty fast because each facture id can be checked with a simple index lookup.