If I have this schema:
create table orders (
id integer,
primary key(id)
);
create table orderlines (
id integer,
orderid integer,
primary key(id),
foreign key (orderid) references orders(id)
);
and these queries:
select * from orderlines where orderid in (1);
select * from orderlines where orderid in (select orderid from orders where orderid=1);
When I explain
these, I get these results:
QUERY PLAN
------------------------------------------------------------
Seq Scan on orderlines (cost=0.00..38.25 rows=11 width=8)
Filter: (orderid = 1)
(2 rows)
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on orderlines (cost=0.00..47362.65 rows=1130 width=8)
Filter: (SubPlan 1)
SubPlan 1
-> Result (cost=0.00..35.50 rows=2550 width=4)
One-Time Filter: (orderlines.orderid = 1)
-> Seq Scan on orders (cost=0.00..35.50 rows=2550 width=0)
If I understand this correctly, PostgreSQL runs the subselect on each orderlines
record, which is very suboptimal.
I would expect the subselect to be run first, and then inserted into the in()
as a kind of array or so.
Why does the planner do this, and how could I make this construction faster?
In the 2nd query in a subquery you use condition "orderid=1", but there is no column "orderid" in table orders however it exists in orderlines. So it's a correlated query.
They both probably return the same result, but in the 2nd one there is a product join the 1st table on the 2nd table.
If you change orderid in subquery to id, the plan is going to be better: Query:
select * from orderlines where orderid in (select id from orders where id=1);
Plan
Nested Loop Semi Join (cost=0.28..29.80 rows=1 width=8)
-> Seq Scan on orderlines (cost=0.00..21.50 rows=1 width=8)
Filter: (orderid = 1)
-> Index Only Scan using orders_pkey on orders (cost=0.28..8.29 rows=1 width=4)
Index Cond: (id = 1)