sql postgresql performance common-table-expression recursive-cte

Why does a simple recursive CTE make PostgreSQL resort to an external sort?

Using PostgreSQL 14, I'm working on a table workplaces which models a tree of workplaces. Besides attributes like country_code, each workplace identifies its parent workplace via a parent_id foreign key (which is NULL in case there is no parent):

CREATE TABLE workplaces(
  id SERIAL PRIMARY KEY,
  parent_id INTEGER REFERENCES(workplaces.id),
  ..
)

I'd now like to identify groups of workplaces in this table by picking a couple of interesting rows and then identifying all descendants (i.e. children, grand-children etc.) of that. My thinking was to build a table which maps workplace IDs to their 'root ancestor'. A recursive CTE does the job:

WITH RECURSIVE

"ancestors" AS (
  SELECT
    id AS id,
    id AS ancestor_or_self_id
  FROM
    workplaces
  WHERE country_code = 'DE' AND type = 'clinic' AND duplicate_of IS NULL

  UNION ALL (

  SELECT
    w.id AS id,
    a.ancestor_or_self_id AS ancestor_or_self_id

  FROM workplaces AS w
  JOIN ancestors AS a ON w.parent_id = a.id
  )
)
SELECT COUNT(*) FROM ancestors;
;

What surprised me is that this takes surprisingly long (about a second on my laptop). A few numbers to put this into perspective:

The workplaces table has 295k rows
The base step of the recursion yields 3758 rows.
The entire query produces 35431 rows.
The recursion is at most 4 levels deep at this point.

This is the query plan:

                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=789384.97..789384.98 rows=1 width=8) (actual time=795.210..795.246 rows=1 loops=1)
   CTE ancestors
     ->  Recursive Union  (cost=1000.00..723164.45 rows=2943134 width=16) (actual time=41.765..788.647 rows=35431 loops=1)
           ->  Gather  (cost=1000.00..23501.17 rows=3794 width=16) (actual time=41.764..54.481 rows=3758 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Parallel Seq Scan on workplaces  (cost=0.00..22121.77 rows=1581 width=16) (actual time=37.064..49.939 rows=1253 loops=3)
                       Filter: ((duplicate_of IS NULL) AND ((country_code)::text = 'DE'::text) AND ((type)::text = 'clinic'::text))
                       Rows Removed by Filter: 97135
           ->  Merge Join  (cost=58737.30..64080.06 rows=293934 width=16) (actual time=112.727..119.909 rows=5279 loops=6)
                 Merge Cond: (a.id = w.parent_id)
                 ->  Sort  (cost=3644.41..3739.26 rows=37940 width=16) (actual time=1.360..1.613 rows=5905 loops=6)
                       Sort Key: a.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  WorkTable Scan on ancestors a  (cost=0.00..758.80 rows=37940 width=16) (actual time=0.001..0.344 rows=5905 loops=6)
                 ->  Materialize  (cost=55092.89..56568.70 rows=295163 width=16) (actual time=109.127..115.263 rows=43088 loops=6)
                       ->  Sort  (cost=55092.89..55830.80 rows=295163 width=16) (actual time=109.126..111.831 rows=43088 loops=6)
                             Sort Key: w.parent_id
                             Sort Method: external merge  Disk: 5696kB
                             ->  Seq Scan on workplaces w  (cost=0.00..23228.63 rows=295163 width=16) (actual time=0.007..67.569 rows=295163 loops=6)
   ->  CTE Scan on ancestors  (cost=0.00..58862.68 rows=2943134 width=0) (actual time=41.769..793.929 rows=35431 loops=1)
 Planning Time: 0.377 ms
 Execution Time: 797.738 ms

My reading of this is that most of the time is spent in sorting rows as part of the recursive step (as part of materializing the CTE): it appears PostgreSQL decides to not do this in memory but rather does a merge sort on disk. Why is that though -- shouldn't sorting a table with just two integers and row counts like what I'm dealing with easily fit into memory?

There is just one relevant index on the workplaces table, on the id column (by virtue of it being the primary key). I tried also setting an index on the parent_id column, but that didn't seem to help with anything.

The work_mem setting is documented to affect sorting: it is set to just 4MB on this instance. Bumping it to 32MB helps quite a bit -- it avoids the external disk merge and instead makes PostgreSQL do a quicksort in memory.

Is there a way to accelerate the query without throwing more memory at it? It puzzles me that it's ever trying to sort all 295k workplace rows - my assumption is that it would merely need to sort each recursive step in order to perform a merge sort (and each step yields a much smaller number of rows than 295k).

Solution

Without access to the data, I just have to guess what could work for indexing.

CREATE INDEX idx_workplaces_country_type 
ON workplaces(country_code, type) -- maybe first type and then country_code
WHERE duplicate_of IS NULL;

And definitely this one:

CREATE INDEX idx_workplaces_parent_id 
ON workplaces(parent_id);

Please share the new query plan so we can see that something changed/improved;

You could also include the id in the index. This could enable an index-only scan:

CREATE INDEX idx_workplaces_parent_id_include_id 
ON workplaces(parent_id) INCLUDE(id);

random_page_cost should represent the performance of random IO compared to sequential IO. The default value 4 used to work well for rotating disks, but for SSD you should pick a lower value. Something close to 1 is a good starting point, like 1.1