Slow join query on large table using Postgres

I have a growing table called uploads that currently has around 10 million rows.

Table "public.uploads" (10 million rows)

  Column   |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
-----------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 postal    | character varying(255)   |           | not null |         | extended |              |
 id        | character varying(255)   |           | not null |         | extended |              |
 pk        | character varying(255)   |           | not null |         | extended |              |
 createdAt | timestamp with time zone |           | not null |         | plain    |              |
 updatedAt | timestamp with time zone |           | not null |         | plain    |              |
Indexes:
    "uploads_pkey" PRIMARY KEY, btree (pk)
    "uploads_id" btree (id)
    "uploads_postal" btree (postal)
Access method: heap

I have another table that contains 8307 rows. The number of rows on this table will not change.

Table "public.germanypostcodelookups" (8307 rows)

     Column     |          Type          | Collation | Nullable |                      Default                       | Storage  | Stats target | Description
----------------+------------------------+-----------+----------+----------------------------------------------------+----------+--------------+-------------
 id             | integer                |           | not null | nextval('germanypostcodelookups_id_seq'::regclass) | plain    |              |
 pcd            | character varying(255) |           | not null |                                                    | extended |              |
 place          | character varying(255) |           | not null |                                                    | extended |              |
 state          | character varying(255) |           | not null |                                                    | extended |              |
 state_code     | character varying(255) |           | not null |                                                    | extended |              |
 community      | character varying(255) |           | not null |                                                    | extended |              |
 community_code | character varying(255) |           | not null |                                                    | extended |              |
 lat            | character varying(255) |           | not null |                                                    | extended |              |
 long           | character varying(255) |           | not null |                                                    | extended |              |
Indexes:
    "germanypostcodelookups_pkey" PRIMARY KEY, btree (id)
    "germanypostcodelookups_pcd" UNIQUE, btree (pcd)
Access method: heap

I have another table that contains around 2 million rows. The number of rows on this table will not change.

Table "public.postcodelookups" (2 million rows)

  Column   |          Type          | Collation | Nullable |                   Default                   | Storage  | Stats target | Description
-----------+------------------------+-----------+----------+---------------------------------------------+----------+--------------+-------------
 id        | integer                |           | not null | nextval('postcodelookups_id_seq'::regclass) | plain    |              |
 pcd       | character varying(255) |           | not null |                                             | extended |              |
 pcd2      | character varying(255) |           | not null |                                             | extended |              |
 pcds      | character varying(255) |           | not null |                                             | extended |              |
 dointr    | character varying(255) |           | not null |                                             | extended |              |
 doterm    | character varying(255) |           | not null |                                             | extended |              |
 oscty     | character varying(255) |           | not null |                                             | extended |              |
 ced       | character varying(255) |           | not null |                                             | extended |              |
 oslaua    | character varying(255) |           | not null |                                             | extended |              |
 osward    | character varying(255) |           | not null |                                             | extended |              |
 parish    | character varying(255) |           | not null |                                             | extended |              |
 usertype  | character varying(255) |           | not null |                                             | extended |              |
 oseast1m  | character varying(255) |           | not null |                                             | extended |              |
 osnrth1m  | character varying(255) |           | not null |                                             | extended |              |
 osgrdind  | character varying(255) |           | not null |                                             | extended |              |
 oshlthau  | character varying(255) |           | not null |                                             | extended |              |
 nhser     | character varying(255) |           | not null |                                             | extended |              |
 ctry      | character varying(255) |           | not null |                                             | extended |              |
 rgn       | character varying(255) |           | not null |                                             | extended |              |
 streg     | character varying(255) |           | not null |                                             | extended |              |
 pcon      | character varying(255) |           | not null |                                             | extended |              |
 eer       | character varying(255) |           | not null |                                             | extended |              |
 teclec    | character varying(255) |           | not null |                                             | extended |              |
 ttwa      | character varying(255) |           | not null |                                             | extended |              |
 pct       | character varying(255) |           | not null |                                             | extended |              |
 itl       | character varying(255) |           | not null |                                             | extended |              |
 statsward | character varying(255) |           | not null |                                             | extended |              |
 oa01      | character varying(255) |           | not null |                                             | extended |              |
 casward   | character varying(255) |           | not null |                                             | extended |              |
 npark     | character varying(255) |           | not null |                                             | extended |              |
 lsoa01    | character varying(255) |           | not null |                                             | extended |              |
 msoa01    | character varying(255) |           | not null |                                             | extended |              |
 ur01ind   | character varying(255) |           | not null |                                             | extended |              |
 oac01     | character varying(255) |           | not null |                                             | extended |              |
 oa11      | character varying(255) |           | not null |                                             | extended |              |
 lsoa11    | character varying(255) |           | not null |                                             | extended |              |
 msoa11    | character varying(255) |           | not null |                                             | extended |              |
 wz11      | character varying(255) |           | not null |                                             | extended |              |
 sicbl     | character varying(255) |           | not null |                                             | extended |              |
 bua11     | character varying(255) |           | not null |                                             | extended |              |
 buasd11   | character varying(255) |           | not null |                                             | extended |              |
 ru11ind   | character varying(255) |           | not null |                                             | extended |              |
 oac11     | character varying(255) |           | not null |                                             | extended |              |
 lat       | character varying(255) |           | not null |                                             | extended |              |
 long      | character varying(255) |           | not null |                                             | extended |              |
 lep1      | character varying(255) |           | not null |                                             | extended |              |
 lep2      | character varying(255) |           | not null |                                             | extended |              |
 pfa       | character varying(255) |           | not null |                                             | extended |              |
 imd       | character varying(255) |           | not null |                                             | extended |              |
 calncv    | character varying(255) |           | not null |                                             | extended |              |
 icb       | character varying(255) |           | not null |                                             | extended |              |
 oa21      | character varying(255) |           | not null |                                             | extended |              |
 lsoa21    | character varying(255) |           | not null |                                             | extended |              |
 msoa21    | character varying(255) |           | not null |                                             | extended |              |

Indexes:
        "postcodelookups_pkey" PRIMARY KEY, btree (id)
        "postcodelookups_pcd" UNIQUE, btree (pcd)
    Access method: heap

Running the following is extremely slow:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS) 
SELECT b.state_code, COUNT(b.state_code) 
FROM germanypostcodelookups b 
INNER JOIN public.uploads u ON b.pcd = u.postal 
WHERE u.id = '6ba5321b-5a40-469b-8c21-0ee9fa5f3f14' 
GROUP BY b.state_code;

With the output being:

"GroupAggregate  (cost=8416.31..8417.17 rows=16 width=11) (actual time=187268.518..187268.526 rows=4 loops=1)"
"  Output: b.state_code, count(b.state_code)"
"  Group Key: b.state_code"
"  Buffers: shared hit=4812502 read=3040967"
"  ->  Sort  (cost=8416.31..8416.54 rows=93 width=3) (actual time=187268.495..187268.497 rows=14 loops=1)"
"        Output: b.state_code"
"        Sort Key: b.state_code"
"        Sort Method: quicksort  Memory: 25kB"
"        Buffers: shared hit=4812502 read=3040967"
"        ->  Merge Join  (cost=90.19..8413.27 rows=93 width=3) (actual time=9.706..187268.448 rows=14 loops=1)"
"              Output: b.state_code"
"              Merge Cond: ((b.pcd)::text = (u.postal)::text)"
"              Buffers: shared hit=4812502 read=3040967"
"              ->  Index Scan using germanypostcodelookups_pcd on public.germanypostcodelookups b  (cost=0.29..643.09 rows=8307 width=9) (actual time=0.058..4.054 rows=7575 loops=1)"
"                    Output: b.id, b.pcd, b.place, b.state, b.state_code, b.community, b.community_code, b.lat, b.long"
"                    Buffers: shared hit=332 read=133"
"              ->  Index Scan using uploads_postal on public.uploads u  (cost=0.43..876932.81 rows=2490 width=7) (actual time=7.476..187259.462 rows=14 loops=1)"
"                    Output: u.postal, u.id, u.pk, u.""createdAt"", u.""updatedAt"""
"                    Filter: ((u.id)::text = '6ba5321b-5a40-469b-8c21-0ee9fa5f3f14'::text)"
"                    Rows Removed by Filter: 9825393"
"                    Buffers: shared hit=4812170 read=3040834"
"Planning:"
"  Buffers: shared hit=4 read=7"
"Planning Time: 1.835 ms"
"Execution Time: 187268.927 ms"

However running the following is very fast:

EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS) 
SELECT b.oscty, COUNT(b.oscty) 
FROM postcodelookups b 
INNER JOIN public.uploads u ON b.pcd = u.postal 
WHERE u.id = '0fac6b33-e182-49e5-9e7d-dd7bbcc66d98' 
GROUP BY b.oscty;

With the output being:

"Finalize GroupAggregate  (cost=22070.84..22085.68 rows=28 width=17) (actual time=75.689..83.014 rows=24 loops=1)"
"  Output: b.oscty, count(b.oscty)"
"  Group Key: b.oscty"
"  Buffers: shared hit=2023"
"  ->  Gather Merge  (cost=22070.84..22085.26 rows=28 width=17) (actual time=75.665..82.980 rows=24 loops=1)"
"        Output: b.oscty, (PARTIAL count(b.oscty))"
"        Workers Planned: 1"
"        Workers Launched: 1"
"        Buffers: shared hit=2023"
"        ->  Partial GroupAggregate  (cost=21070.83..21082.10 rows=28 width=17) (actual time=8.364..8.470 rows=12 loops=2)"
"              Output: b.oscty, PARTIAL count(b.oscty)"
"              Group Key: b.oscty"
"              Buffers: shared hit=2023"
"              Worker 0:  actual time=0.193..0.194 rows=0 loops=1"
"                Buffers: shared hit=11"
"              ->  Sort  (cost=21070.83..21074.50 rows=1465 width=9) (actual time=8.358..8.384 rows=250 loops=2)"
"                    Output: b.oscty"
"                    Sort Key: b.oscty"
"                    Sort Method: quicksort  Memory: 48kB"
"                    Buffers: shared hit=2023"
"                    Worker 0:  actual time=0.191..0.192 rows=0 loops=1"
"                      Sort Method: quicksort  Memory: 25kB"
"                      Buffers: shared hit=11"
"                    ->  Nested Loop  (cost=32.16..20993.80 rows=1465 width=9) (actual time=0.139..7.830 rows=250 loops=2)"
"                          Output: b.oscty"
"                          Inner Unique: true"
"                          Buffers: shared hit=2012"
"                          Worker 0:  actual time=0.043..0.043 rows=0 loops=1"
"                          ->  Parallel Bitmap Heap Scan on public.uploads u  (cost=31.73..9013.59 rows=1465 width=7) (actual time=0.107..0.200 rows=250 loops=2)"
"                                Output: u.postal, u.id, u.pk, u.""createdAt"", u.""updatedAt"""
"                                Recheck Cond: ((u.id)::text = '0fac6b33-e182-49e5-9e7d-dd7bbcc66d98'::text)"
"                                Heap Blocks: exact=9"
"                                Buffers: shared hit=12"
"                                Worker 0:  actual time=0.042..0.042 rows=0 loops=1"
"                                ->  Bitmap Index Scan on uploads_id  (cost=0.00..31.11 rows=2490 width=0) (actual time=0.144..0.144 rows=500 loops=1)"
"                                      Index Cond: ((u.id)::text = '0fac6b33-e182-49e5-9e7d-dd7bbcc66d98'::text)"
"                                      Buffers: shared hit=3"
"                          ->  Index Scan using postcodelookups_pcd on public.postcodelookups b  (cost=0.43..8.18 rows=1 width=16) (actual time=0.029..0.029 rows=1 loops=500)"
"                                Output: b.id, b.pcd, b.pcd2, b.pcds, b.dointr, b.doterm, b.oscty, b.ced, b.oslaua, b.osward, b.parish, b.usertype, b.oseast1m, b.osnrth1m, b.osgrdind, b.oshlthau, b.nhser, b.ctry, b.rgn, b.streg, b.pcon, b.eer, b.teclec, b.ttwa, b.pct, b.itl, b.statsward, b.oa01, b.casward, b.npark, b.lsoa01, b.msoa01, b.ur01ind, b.oac01, b.oa11, b.lsoa11, b.msoa11, b.wz11, b.sicbl, b.bua11, b.buasd11, b.ru11ind, b.oac11, b.lat, b.long, b.lep1, b.lep2, b.pfa, b.imd, b.calncv, b.icb, b.oa21, b.lsoa21, b.msoa21"
"                                Index Cond: ((b.pcd)::text = (u.postal)::text)"
"                                Buffers: shared hit=2000"
"Planning:"
"  Buffers: shared hit=156 read=1"
"Planning Time: 12.088 ms"
"Execution Time: 83.249 ms"

My question is - why is the first query so much slower than the second despite it having to aggregate across a smaller number of "uploads"? And how can I optimise it?

Solution

With a merge join, once one child node runs out of values, the other child node can be terminated early.

Looking at the cost involved, clearly this is what the planner is expecting to happen. With the merge join cost of 8413.27 but one child cost of 876932.81, it expects to abandon the child after executing less than 1/100 of it. But this expectation is obviously not fulfilled and it has to read essentially the entire index.

In other words, it thinks that over 99% of the postal codes in uploads will be after the highest value ever seen for pcd in germanypostcodelookups, while that is not actually the case.

The logic of how this merge-join estimation is made have changed over the years, so without knowing the version it is hard to guess what has gone wrong. But an index on (id, postal) will provide a much faster alternative way to run a merge join. So if you can't get it to avoid the merge join, at least you should be able to make that merge join hugely faster.