Search code examples
postgresqlsql-execution-planexplain

Postgresql huge performance difference when using IN vs NOT IN


I have 2 tables, "transaksi" and "buku". "transaksi" has around ~250k rows, and buku has around ~170k rows. Both tables have column called "k999a", and both tables use no indexes. Now I check these 2 statements.

Statement 1:

explain select k999a from transaksi where k999a not in (select k999a from buku);

Statement 1 outputs:

 Seq Scan on transaksi  (cost=0.00..721109017.46 rows=125426 width=9)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..5321.60 rows=171040 width=8)
           ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Statement 2:

explain select k999a from transaksi where k999a in (select k999a from buku);

Statement 2 outputs:

Hash Semi Join  (cost=6604.40..22664.82 rows=250853 width=9)
   Hash Cond: (transaksi.k999a = buku.k999a)
   ->  Seq Scan on transaksi  (cost=0.00..6356.53 rows=250853 width=9)
   ->  Hash  (cost=3797.40..3797.40 rows=171040 width=8)
         ->  Seq Scan on buku  (cost=0.00..3797.40 rows=171040 width=8)

Why in the NOT IN query, postgresql does loop join, making the query takes a long time?

PS: postgresql version 9.6.1 on windows 10


Solution

  • This is to be expected. You may get better performance using WHERE NOT EXISTS instead:

    SELECT k999a
    FROM transaksi
    WHERE NOT EXISTS (
        SELECT 1 FROM buku WHERE buku.k999a = transaksi.k999a LIMIT 1
    );
    

    Here is a good explanation as to why for each of the methods: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/