Search code examples
postgresqloptimizationsubquery

Why is using a SELECT inside IN() so expensive?


If I have this schema:

create table orders (
         id integer, 
         primary key(id)
);
create table orderlines (
         id integer, 
         orderid integer, 
         primary key(id), 
         foreign key (orderid) references orders(id)
);

and these queries:

select * from orderlines where orderid in (1);
select * from orderlines where orderid in (select orderid from orders where orderid=1);

When I explain these, I get these results:

                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on orderlines  (cost=0.00..38.25 rows=11 width=8)
   Filter: (orderid = 1)
(2 rows)

                               QUERY PLAN                               
------------------------------------------------------------------------
 Seq Scan on orderlines  (cost=0.00..47362.65 rows=1130 width=8)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Result  (cost=0.00..35.50 rows=2550 width=4)
           One-Time Filter: (orderlines.orderid = 1)
           ->  Seq Scan on orders  (cost=0.00..35.50 rows=2550 width=0)

If I understand this correctly, PostgreSQL runs the subselect on each orderlines record, which is very suboptimal.

I would expect the subselect to be run first, and then inserted into the in() as a kind of array or so.

Why does the planner do this, and how could I make this construction faster?


Solution

  • In the 2nd query in a subquery you use condition "orderid=1", but there is no column "orderid" in table orders however it exists in orderlines. So it's a correlated query.

    They both probably return the same result, but in the 2nd one there is a product join the 1st table on the 2nd table.

    If you change orderid in subquery to id, the plan is going to be better: Query:

    select * from orderlines where orderid in (select id from orders where id=1);
    

    Plan

        Nested Loop Semi Join  (cost=0.28..29.80 rows=1 width=8)
      ->  Seq Scan on orderlines  (cost=0.00..21.50 rows=1 width=8)
            Filter: (orderid = 1)
      ->  Index Only Scan using orders_pkey on orders  (cost=0.28..8.29 rows=1 width=4)
            Index Cond: (id = 1)