Search code examples
sqlpostgresqlpostgisexplainpostgresql-performance

Reuse computed select value


I'm trying to use ST_SnapToGrid and then GROUP BY the grid cells (x, y). Here is what I did first:

SELECT
  COUNT(*)                      AS n,
  ST_X(ST_SnapToGrid(geom, 50)) AS x,
  ST_Y(ST_SnapToGrid(geom, 50)) AS y
FROM points
GROUP BY x, y

I don't want to recompute ST_SnapToGrid for both x and y. So I changed it to use a sub-query:

SELECT
  COUNT(*)   AS n,
  ST_X(geom) AS x,
  ST_Y(geom) AS y
FROM (
  SELECT 
      ST_SnapToGrid(geom, 50) AS geom
  FROM points
) AS tmp
GROUP BY x, y

But when I run EXPLAIN, both of these queries have the exact same execution plan:

GroupAggregate  (...)
  ->  Sort  (...)
        Sort Key: (st_x(st_snaptogrid(points.geom, 0::double precision))), (st_y(st_snaptogrid(points.geom, 0::double precision)))
        ->  Seq Scan on points  (...)

Question: Will PostgreSQL reuse the result value of ST_SnapToGrid()?
If not, is there a way to make it do this?


Solution

  • Test timing

    You don't see the evaluation of individual functions per row in the EXPLAIN output.

    Test with EXPLAIN ANALYZE to get actual query times to compare overall effectiveness. Run a couple of times to rule out caching artifacts. For simple queries like this, you get more reliable numbers for the total runtime with:

    EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
    

    Requires Postgres 9.2+. Per documentation:

    TIMING

    Include actual startup time and time spent in each node in the output. The overhead of repeatedly reading the system clock can slow down the query significantly on some systems, so it may be useful to set this parameter to FALSE when only actual row counts, and not exact times, are needed. Run time of the entire statement is always measured, even when node-level timing is turned off with this option. This parameter may only be used when ANALYZE is also enabled. It defaults to TRUE.

    Prevent repeated evaluation

    Normally, expressions in a subquery are evaluated once. But Postgres can collapse trivial subqueries if it thinks that will be faster.

    To introduce an optimization barrier, you could use a CTE instead of the subquery. This guarantees that Postgres computes ST_SnapToGrid(geom, 50) once only:

    WITH cte AS (
       SELECT ST_SnapToGrid(geom, 50) AS geom1
       FROM   points
       )
    SELECT COUNT(*)   AS n
         , ST_X(geom1) AS x
         , ST_Y(geom1) AS y
    FROM   cte
    GROUP  BY geom1;         -- see below
    

    However, this it's probably slower than a subquery due to more overhead for a CTE. The function call is probably very cheap. Generally, Postgres knows better how to optimize a query plan. Only introduce such an optimization barrier if you know better.

    Simplify

    I changed the name of the computed point in the subquery / CTE to geom1 to clarify it's different from the original geom. That helps to clarify the more important thing here:

    GROUP BY geom1
    

    instead of:

    GROUP BY x, y

    That's obviously cheaper - and may have an influence on whether the function call is repeated. So, this is probably fastest:

    SELECT COUNT(*) AS n
         , ST_X(ST_SnapToGrid(geom, 50)) AS x
         , ST_y(ST_SnapToGrid(geom, 50)) AS y
    FROM   points
    GROUP  BY ST_SnapToGrid(geom, 50);         -- same here!
    

    Or maybe this:

    SELECT COUNT(*)    AS n
         , ST_X(geom1) AS x
         , ST_y(geom1) AS y
    FROM (
       SELECT ST_SnapToGrid(geom, 50) AS geom1
       FROM   points
       ) AS tmp
    GROUP  BY geom1;
    

    Test all three with EXPLAIN ANALYZE or EXPLAIN (ANALYZE, TIMING OFF) and see for yourself. Testing >> guessing.