Search code examples
postgresqlquery-performance

Avoid using Nested Loop Join while using a non Equi join condition


Postgres is using a Nested Loop Join algorithm when I use a non equi join condition in my update query. I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation as per [https://www.postgresql.org/docs/8.3/planner-optimizer.html]

The update query and the execution plan is below.

Query

explain analyze
UPDATE target_tbl tgt
set descr     = stage.descr,
    prod_name = stage.prod_name,
    item_name = stage.item_name,
    url       = stage.url,
    col1_name = stage.col1_name,
    col2_name = stage.col2_name,
    col3_name = stage.col3_name,
    col4_name = stage.col4_name,
    col5_name = stage.col5_name,
    col6_name = stage.col6_name,
    col7_name = stage.col7_name,
    col8_name = stage.col8_name,
    flag      = stage.flag
from tbl1 stage
where tgt.col1 = stage.col1
  and tgt.col2 = stage.col2
  and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
  and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
  and stage.row_number::int >= 1::int
  and stage.row_number::int < 50001::int;

Execution Plan

Update on target_tbl tgt  (cost=0.56..3557.91 rows=1 width=813) (actual time=346153.460..346153.460 rows=0 loops=1)
  ->  Nested Loop  (cost=0.56..3557.91 rows=1 width=813) (actual time=4.326..163876.029 rows=50000 loops=1)
        ->  Seq Scan on tbl1 stage  (cost=0.00..2680.96 rows=102 width=759) (actual time=3.060..2588.745 rows=50000 loops=1)
              Filter: (((row_number)::integer >= 1) AND ((row_number)::integer < 50001))
        ->  Index Scan using tbl_idx on target_tbl tgt  (cost=0.56..8.59 rows=1 width=134) (actual time=3.152..3.212 rows=1 loops=50000)
              Index Cond: ((col1 = stage.col1) AND (col2 = stage.col2) AND (COALESCE(col3, 'col3'::text) = COALESCE(stage.col3, 'col3'::text)) AND (COALESCE(col4, 'col4'::text) = COALESCE(stage.col4, 'col4'::text)))
Planning time: 17.700 ms
Execution time: 346157.168 ms
  1. Is there any way to avoid the nested loop join during the execution of the above query?

  2. Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?


Solution

  • PostgreSQL can choose a different join strategy in that case. The reason why it doesn't is the gross mis-estimate in the sequential scan: 102 instead of 50000.

    Fix that problem, and things will get better:

    ANALYZE tbl1;
    

    If that is not enough, collect more detailed statistics:

    ALTER TABLE tbl1 ALTER row_number SET STATISTICS 1000;
    ANALYZE tbl1;
    

    All this assumes that row_number is an integer and the type cast is redundant. If you made the mistake to use a different data type, an index is your only hope:

    CREATE INDEX ON tbl1 ((row_number::integer));
    ANALYZE tbl1;