postgresql performance query-optimization common-table-expression

Postgresql poorly planned query runs too long

I have a complex query which has been greatly simplified below, running on "PostgreSQL 11.9 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 7.4.0, 64-bit", running on an AWS Aurora Serverless 2xlarge server (8 cores, 64GB RAM).

I have the following...

mv_journey, a materialized view with ~550M rows which contains information about journeys which have an origin and destination, and some measures about those (how long the journey took, etc), defined with columns from_id and from_region which identify the origins, and to_id and to_region for the destinations.

place_from and place_to, which are calculated from a function, fn_location_get, in an initial step of a CTE, and contain id and region (which map to the from_id, from_region, and to_id, to_region, respectively). These also contain rollup levels from the region, eg country, continent. Typically these return between ~100 and 20,000 rows.

Later in that CTE, I use place_from and place_to to filter the 550M mv_journey rows, and group by to create a rollup report based on journeys, eg from country to country.

The simplified query is something like this.

WITH place_from AS (

  select * 
  from fn_location_get(...)

), place_to AS (

  select * 
  from fn_location_get(...)

)

select [many dimension columns...]
  , [a few aggregated measure columns]
from mv_journey j
  inner join place_from o on j.from_id = o.id
    and j.from_region = o.region
  inner join place_to d on j.from_id = d.id
    and j.from_region = d.region
where service_type_id = ?
group by [many dimension columns...]

I have indexes on mv_journey

CREATE INDEX idx_mv_journey_from ON mv_journey (from_id, from_region);
CREATE INDEX idx_mv_journey_to ON mv_journey (to_id, to_region);

When I run the query (using SET LOCAL work_mem = '2048MB' to invoke quicksorts) with a small number of rows in the place_from (92) and a large number in place_to (~18,000), the query runs in about 25 seconds with the following query plan (which includes the steps in the CTE to generate place_from and place_to).

"GroupAggregate  (cost=530108.64..530129.64 rows=30 width=686) (actual time=13097.187..25408.707 rows=92 loops=1)"
"  Group Key: [many dimension columns...]"
"  CTE place_from"
"    ->  Function Scan on fn_location_get (cost=0.25..10.25 rows=1000 width=396) (actual time=34.275..34.331 rows=92 loops=1)"
"  CTE place_to"
"    ->  Function Scan on fn_location_get (cost=0.25..10.25 rows=1000 width=396) (actual time=96.287..97.428 rows=18085 loops=1)"
"  ->  Sort  (cost=530088.14..530088.22 rows=30 width=622) (actual time=12935.329..13295.468 rows=1871349 loops=1)"
"        Sort Key: [many dimension columns...]"
"        Sort Method: quicksort  Memory: 826782kB"
"        ->  Merge Join  (cost=529643.68..530087.41 rows=30 width=622) (actual time=4708.780..6021.449 rows=1871349 loops=1)"
"              Merge Cond: ((j.to_id = d.id) AND (j.to_region = d.region))"
"              ->  Sort  (cost=529573.85..529719.16 rows=58124 width=340) (actual time=4583.265..4788.625 rows=1878801 loops=1)"
"                    Sort Key: j.to_id, j.to_region"
"                    Sort Method: quicksort  Memory: 623260kB"
"                    ->  Nested Loop  (cost=0.57..524974.25 rows=58124 width=340) (actual time=34.324..3079.815 rows=1878801 loops=1)"
"                          ->  CTE Scan on place_from o  (cost=0.00..20.00 rows=1000 width=320) (actual time=34.277..34.432 rows=92 loops=1)"
"                          ->  Index Scan using idx_mv_journey_from on mv_journey j (cost=0.57..524.37 rows=58 width=60) (actual time=0.018..30.022 rows=20422 loops=92)"
"                                Index Cond: ((from_id = o.id) AND (from_region = o.region))"
"                                Filter: (service_type_id = 'ALL'::text)"
"                                Rows Removed by Filter: 81687"
"              ->  Sort  (cost=69.83..72.33 rows=1000 width=320) (actual time=125.505..223.780 rows=1871350 loops=1)"
"                    Sort Key: d.id, d.region"
"                    Sort Method: quicksort  Memory: 3329kB"
"                    ->  CTE Scan on place_to d  (cost=0.00..20.00 rows=1000 width=320) (actual time=96.292..103.677 rows=18085 loops=1)"
"Planning Time: 0.546 ms"
"Execution Time: 25501.827 ms"

The problem is that when I swap the locations in the from/to, ie, large number of rows in the place_from (~18,000) and a small number in place_to (92), the query takes forever. By the way, mv_journey is expected to have the same number of rows matched in both cases - there are not more records expected in one direction than the other.

I have not once got this second query to complete without it running for hours and PGAdmin 4 losing the connection to the server. I therefore cannot even do a EXPLAIN ANALYZE on it. However I have the EXPLAIN:

"GroupAggregate  (cost=474135.40..474152.90 rows=25 width=686)"
"  Group Key: [many dimension columns...]"
"  CTE place_from"
"    ->  Function Scan on fn_location_get  (cost=0.25..10.25 rows=1000 width=396)"
"  CTE place_to"
"    ->  Function Scan on fn_location_get (cost=0.25..10.25 rows=1000 width=396)"
"  ->  Sort  (cost=474114.90..474114.96 rows=25 width=622)"
"        Sort Key: [many dimension columns...]"
"        ->  Merge Join  (cost=473720.23..474114.31 rows=25 width=622)"
"              Merge Cond: ((j.to_id = d.id) AND (j.to_region = d.region))## Heading ##"
"              ->  Sort  (cost=473650.40..473779.18 rows=51511 width=340)"
"                    Sort Key: j.to_id, j.to_region"
"                    ->  Nested Loop  (cost=0.57..469619.00 rows=51511 width=340)"
"                          ->  CTE Scan on place_from o  (cost=0.00..20.00 rows=1000 width=320)"
"                          ->  Index Scan using idx_mv_journey_from on mv_journey j   (cost=0.57..469.08 rows=52 width=60)"
"                                Index Cond: ((from_id = o.id) AND (from_region = o.region))"
"                                Filter: (service_type_id = 'ALL'::text)"
"              ->  Sort  (cost=69.83..72.33 rows=1000 width=320)"
"                    Sort Key: d.id, d.region"
"                    ->  CTE Scan on place_to d  (cost=0.00..20.00 rows=1000 width=320)"

My assumption was that if I had the equivalent indexes on both sides of the from/to, then Postgres would use the mirror-opposite query plan, doing a merge join for the origin and a nested loop join using idx_mv_journey_to for the destination.

But it looks like the query planner's row count estimates are way off in both queries. It seems only luck that the first query performs so well despite that.

I have tried the following, none of which worked

swap the inner join statements so the destination join is first
ALTER TABLE mv_journey ALTER COLUMN to_id SET STATISTICS 1000; ANALYZE mv_journey
ALTER TABLE mv_journey ALTER COLUMN from_id SET STATISTICS 1000; ANALYZE mv_journey

I guess the plan is done before the start of CTE execution? And that's why it has no idea what will come out of the fn_location_get calls that create the place_from and place_to sets?

fn_location_get is a complicated function with its own recursive CTE and I don't want to bring its logic out of the function and into this CTE.

What's the best way out of this mess?

Solution

The most straightforward approach is would be to create two temp tables as the result of the function calls, manually ANALYZE them, then run the query against the temp tables rather than the function calls.