sql postgresql performance postgresql-performance

WHERE clause is slower with value from CTE than with constant?

I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime), but it takes a lot longer.

Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold.

locations table is defined as:

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)?

Solution

The explanation behind the difference you observed is this:

Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold. With favorable filter values, this can lead to a much more efficient query plan.

In the other case, when datetime_threshold has to be computed in another SELECT first, Postgres has to default to a generic plan. datetime_threshold could be anything.

The difference will become obvious in EXPLAIN output.

To make sure Postgres optimizes the second part for the actual datetime_threshold value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.

For example

CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
  RETURNS SETOF locations
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
     'SELECT *
      FROM   locations
      WHERE  user_id = $1
      AND    datetime > $2'
   USING _user_id
      , (SELECT max(datetime)
         FROM   locations
         WHERE  distance > _distance
         AND    user_id = _user_id);
END
$func$;

Call:

SELECT * FROM foo(9087);

In extreme cases, you might even use another dynamic query to calculate datetime_threshold. But I don't expect that's necessary.

As for "something useful in the docs":

[...] The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Bold emphasis mine.

Indexes

Perfect indexes would be:

CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time);           -- for query 2

Fine tuning depends on undisclosed details. Partial index might be an option.

There may be any number of additional reasons why your query is slow. Not enough details.