Search code examples
sqlpostgresqlperformancepostgresql-performance

WHERE clause is slower with value from CTE than with constant?


I want to have a variable cached during a query performing on Postgres 12. I followed the approach of having CTE like below:

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

Running the above query will lead to performance issues. I expected the total runtime to approximately equal (part1 runtime + part2 runtime), but it takes a lot longer.

Notably, there is no performance issue when I run only the 2nd part with manual datetime_threshold.

locations table is defined as:

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

Is there any way to reduce the total runtime to something like (part1 runtime + part2 runtime)?


Solution

  • The explanation behind the difference you observed is this:

    Postgres has column statistics and can adapt the query plan depending on the value of a provided constant for datetime_threshold. With favorable filter values, this can lead to a much more efficient query plan.

    In the other case, when datetime_threshold has to be computed in another SELECT first, Postgres has to default to a generic plan. datetime_threshold could be anything.

    The difference will become obvious in EXPLAIN output.

    To make sure Postgres optimizes the second part for the actual datetime_threshold value, you can either run two separate queries (feed the result of query 1 as constant to query 2), or use dynamic SQL to force re-planning of query 2 every time in a PL/pgSQL function.

    For example

    CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
      RETURNS SETOF locations
      LANGUAGE plpgsql AS
    $func$
    BEGIN
       RETURN QUERY EXECUTE 
         'SELECT *
          FROM   locations
          WHERE  user_id = $1
          AND    datetime > $2'
       USING _user_id
          , (SELECT max(datetime)
             FROM   locations
             WHERE  distance > _distance
             AND    user_id = _user_id);
    END
    $func$;
    

    Call:

    SELECT * FROM foo(9087);
    

    Related:

    In extreme cases, you might even use another dynamic query to calculate datetime_threshold. But I don't expect that's necessary.

    As for "something useful in the docs":

    [...] The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

    Bold emphasis mine.

    Indexes

    Perfect indexes would be:

    CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
    CREATE INDEX ON locations (user_id, date_time);           -- for query 2
    

    Fine tuning depends on undisclosed details. Partial index might be an option.

    There may be any number of additional reasons why your query is slow. Not enough details.