I'm having a Postgresql (version 9.4) performance puzzle. I have a function (prevd
) declared as STABLE
(see below). When I run this function on a constant in where
clause, it is called multiple times - instead of once.
If I understand postgres documentation correctly, the query should be optimized to call prevd
only once.
A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement
Why it doesn't optimize calls to prevd
in this case?
I'm not expecting prevd
to be called once for all subsequent queries using prevd
on the same argument (like it was IMMUTABLE). I'm expecting postgres to create a plan for my query with just one call to prevd('2015-12-12')
Please find the code below:
Schema
create table somedata(d date, number double precision);
create table dates(d date);
insert into dates
select generate_series::date
from generate_series('2015-01-01'::date, '2015-12-31'::date, '1 day');
insert into somedata
select '2015-01-01'::date + (random() * 365 + 1)::integer, random()
from generate_series(1, 100000);
create or replace function prevd(date_ date)
returns date
language sql
stable
as $$
select max(d) from dates where d < date_;
$$
Slow Query
select avg(number) from somedata where d=prevd('2015-12-12');
Poor query plan of the query above
Aggregate (cost=28092.74..28092.75 rows=1 width=8) (actual time=3532.638..3532.638 rows=1 loops=1)
Output: avg(number)
-> Seq Scan on public.somedata (cost=0.00..28091.43 rows=525 width=8) (actual time=10.210..3532.576 rows=282 loops=1)
Output: d, number
Filter: (somedata.d = prevd('2015-12-12'::date))
Rows Removed by Filter: 99718
Planning time: 1.144 ms
Execution time: 3532.688 ms
(8 rows)
Performance
The query above, on my machine runs around 3.5s. After changing prevd
to IMMUTABLE, it's changing to 0.035s.
I started writing this as a comment, but it got a bit long, so I'm expanding it into an answer.
As discussed in this previous answer, Postgres does not promise to always optimise based on STABLE
or IMMUTABLE
annotations, only that it can sometimes do so. It does this by planning the query differently by taking advantage of certain assumptions. This part of the previous answer is directly analogous to your case:
This particular sort of rewriting depends upon immutability or stability. With
where test_multi_calls1(30) != num
query re-writing will happen forimmutable
but not for merelystable
functions.
If you change the function to IMMUTABLE
and look at the query plan, you will see that the rewriting it does is really rather radical:
Seq Scan on public.somedata (cost=0.00..1791.00 rows=272 width=12) (actual time=0.036..14.549 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = '2015-12-11'::date)
Buffers: shared read=541 written=14
Total runtime: 14.589 ms
It actually runs the function while planning the query, and substitutes the value before the query is even executed. With a STABLE
function, this optimisation would clearly not be appropriate - the data might change between planning and executing the query.
In a comment, it was mentioned that this query results in an optimised plan:
select avg(number) from somedata where d=(select prevd(date '2015-12-12'));
This is fast, but note that the plan doesn't look anything like what the IMMUTABLE
version did:
Aggregate (cost=1791.69..1791.70 rows=1 width=8) (actual time=14.670..14.670 rows=1 loops=1)
Output: avg(number)
Buffers: shared read=541 written=21
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Output: '2015-12-11'::date
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.026..14.589 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = $0)
Buffers: shared read=541 written=21
Total runtime: 14.707 ms
By putting it into a sub-query, you are moving the function call from the WHERE clause to the SELECT clause. More importantly, the sub-query can always be executed once and used by the rest of the query; so the function is run once in a separate node of the plan.
To confirm this, we can take the SQL out of a function altogether:
select avg(number) from somedata where d=(select max(d) from dates where d < '2015-12-12');
This gives a rather longer plan with very similar performance:
Aggregate (cost=1799.12..1799.13 rows=1 width=8) (actual time=14.174..14.174 rows=1 loops=1)
Output: avg(somedata.number)
Buffers: shared read=543 written=19
InitPlan 1 (returns $0)
-> Aggregate (cost=7.43..7.44 rows=1 width=4) (actual time=0.150..0.150 rows=1 loops=1)
Output: max(dates.d)
Buffers: shared read=2
-> Seq Scan on public.dates (cost=0.00..6.56 rows=347 width=4) (actual time=0.015..0.103 rows=345 loops=1)
Output: dates.d
Filter: (dates.d < '2015-12-12'::date)
Buffers: shared read=2
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.190..14.098 rows=270 loops=1)
Output: somedata.d, somedata.number
Filter: (somedata.d = $0)
Buffers: shared read=543 written=19
Total runtime: 14.232 ms
The important thing to note is that the inner Aggregate (the max(d)
) is executed once, on a separate node from the main Seq Scan (which is checking the where
clause). In this position, even a VOLATILE
function can be optimised in the same way.
In short, while you know that the query you've produced can be optimised by executing the function only once, it doesn't match any of the patterns that Postgres's query planner knows how to rewrite, so it uses a naive plan which runs the function multiple times.
[Note: all tests performed on Postgres 9.1, because it's what I happened to have to hand.]