I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:
-- BEGIN PART 1
with cached_vars as (
select max(datetime) as datetime_threshold
from locations
where distance > 70
and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
and datetime > (select datetime_threshold from cached_vars)
-- END PART 2
Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime)
, but it takes a lot longer.
Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold
.
locations
table is defined as:
id | user_id | datetime | location | distance | ...
-----------------------------------------------------
Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)
?
The explanation behind the difference you observed is this:
Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold
. With favorable filter values, this can lead to a much more efficient query plan.
In the other case, when datetime_threshold
has to be computed in another SELECT
first, Postgres has to default to a generic plan. datetime_threshold
could be anything.
The difference will become obvious in EXPLAIN
output.
To make sure Postgres optimizes the second part for the actual datetime_threshold
value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.
CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
RETURNS SETOF locations
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT *
FROM locations
WHERE user_id = $1
AND datetime > $2'
USING _user_id
, (SELECT max(datetime)
FROM locations
WHERE distance > _distance
AND user_id = _user_id);
END
$func$;
Call:
SELECT * FROM foo(9087);
Related:
In extreme cases, you might even use another dynamic query to calculate datetime_threshold
. But I don't expect that's necessary.
As for "something useful in the docs":
[...] The important difference is that
EXECUTE
will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to useEXECUTE
to positively ensure that a generic plan is not selected.
Bold emphasis mine.
Perfect indexes would be:
CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time); -- for query 2
Fine tuning depends on undisclosed details. Partial index might be an option.
There may be any number of additional reasons why your query is slow. Not enough details.