Speed Up Query with Multiple Inner and Outer Joins

I'm having trouble with a slow Postgresql query. I've gone through the standard postgresql.conf changes and verified the referenced columns are indexed. Other than that, I'm not sure what the next step would be. The query below takes just under 3 minutes to run. Any help is appreciated.

select distinct
     exp.assay_id as ASSAY_KEY,
     rest.result_type_id as RESULT_TYPE_ID,
     rest.name as RESULT_TYPE,
     rest.unit as REST_UNIT,
     dtrest.name as REST_DATA_TYPE,
     cont.condition_type_id as COND_TYPE_ID,
     cont.name as COND_TYPE,
     cont.unit as COND_UNIT,
     dtcont.name as COND_DATA_TYPE,
     expcon.unit as EXP_COND_UNIT
from
     public.experiment exp
     inner join public.experiment_result expr on expr.experiment_id = exp.experiment_id
     inner join public.result_type rest on rest.result_type_id = expr.result_type_id
     left outer join public.experiment_condition expcon on expcon.experiment_id = expr.experiment_id
     left outer join public.condition_type cont on cont.condition_type_id = expcon.condition_type_id
     left outer join public.data_type dtcont on dtcont.data_type_id = cont.data_type_id
     left outer join public.data_type dtrest on dtrest.data_type_ID = rest.data_type_ID
where
     exp.assay_id in (255)

EXPLAIN ANALYZE results:

Unique  (cost=51405438.73..52671302.26 rows=50634541 width=1109) (actual time=123349.423..164779.863 rows=3 loops=1)
  ->  Sort  (cost=51405438.73..51532025.09 rows=50634541 width=1109) (actual time=123349.421..157973.215 rows=29521242 loops=1)
        Sort Key: rest.result_type_id, rest.name, rest.unit, dtrest.name, cont.condition_type_id, cont.name, cont.unit, dtcont.name, expcon.unit
        Sort Method: external merge  Disk: 3081440kB
        ->  Hash Left Join  (cost=56379.88..1743073.05 rows=50634541 width=1109) (actual time=1307.931..26398.626 rows=29521242 loops=1)
              Hash Cond: (rest.data_type_id = dtrest.data_type_id)
              ->  Hash Left Join  (cost=56378.68..1547566.26 rows=50634541 width=799) (actual time=1307.894..21181.787 rows=29521242 loops=1)
                    Hash Cond: (expr.experiment_id = expcon.experiment_id)
                    ->  Hash Join  (cost=5096.61..572059.62 rows=15984826 width=47) (actual time=1002.697..11046.550 rows=9840414 loops=1)
                          Hash Cond: (expr.result_type_id = rest.result_type_id)
                          ->  Hash Join  (cost=5091.86..528637.07 rows=15984826 width=24) (actual time=44.062..7969.272 rows=9840414 loops=1)
                                Hash Cond: (expr.experiment_id = exp.experiment_id)
                                ->  Seq Scan on experiment_result expr  (cost=0.00..462557.70 rows=23232570 width=16) (actual time=0.080..4357.646 rows=23232570 loops=1)
                                ->  Hash  (cost=3986.11..3986.11 rows=88460 width=16) (actual time=43.743..43.744 rows=88135 loops=1)
                                      Buckets: 131072  Batches: 1  Memory Usage: 5156kB
                                      ->  Seq Scan on experiment exp  (cost=0.00..3986.11 rows=88460 width=16) (actual time=0.016..24.426 rows=88135 loops=1)
                                            Filter: (assay_id = 255)
                                            Rows Removed by Filter: 40434
                          ->  Hash  (cost=3.22..3.22 rows=122 width=31) (actual time=958.617..958.618 rows=128 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 17kB
                                ->  Seq Scan on result_type rest  (cost=0.00..3.22 rows=122 width=31) (actual time=958.542..958.575 rows=128 loops=1)
                    ->  Hash  (cost=9509.53..9509.53 rows=382603 width=768) (actual time=294.654..294.658 rows=382553 loops=1)
                          Buckets: 16384  Batches: 32  Memory Usage: 1077kB
                          ->  Hash Left Join  (cost=2.67..9509.53 rows=382603 width=768) (actual time=0.074..176.040 rows=382553 loops=1)
                                Hash Cond: (cont.data_type_id = dtcont.data_type_id)
                                ->  Hash Left Join  (cost=1.47..8301.31 rows=382603 width=458) (actual time=0.048..117.994 rows=382553 loops=1)
                                      Hash Cond: (expcon.condition_type_id = cont.condition_type_id)
                                      ->  Seq Scan on experiment_condition expcon  (cost=0.00..7102.03 rows=382603 width=74) (actual time=0.016..48.704 rows=382553 loops=1)
                                      ->  Hash  (cost=1.21..1.21 rows=21 width=392) (actual time=0.021..0.022 rows=24 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                            ->  Seq Scan on condition_type cont  (cost=0.00..1.21 rows=21 width=392) (actual time=0.012..0.014 rows=24 loops=1)
                                ->  Hash  (cost=1.09..1.09 rows=9 width=326) (actual time=0.015..0.016 rows=9 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      ->  Seq Scan on data_type dtcont  (cost=0.00..1.09 rows=9 width=326) (actual time=0.008..0.010 rows=9 loops=1)
              ->  Hash  (cost=1.09..1.09 rows=9 width=326) (actual time=0.018..0.019 rows=9 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 9kB
                    ->  Seq Scan on data_type dtrest  (cost=0.00..1.09 rows=9 width=326) (actual time=0.012..0.014 rows=9 loops=1)
Planning Time: 5.997 ms
JIT:
  Functions: 55
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 19.084 ms, Inlining 20.283 ms, Optimization 604.666 ms, Emission 332.835 ms, Total 976.868 ms
Execution Time: 165268.155 ms

Solution

The query has to wade through 30 million rows from the join, since your condition exp.assay_id in (255) is not very restrictive.

It just so happens that most of these result rows are identical, so that only three different rows remain after the DISTINCT.

So there won't be a way to make this query lightning fast – it has to look at the 30 million rows to figure out that there are only three different ones.

But the lion's share of the execution time (132 out of 165 seconds) are spent sorting, so it should be possible to make the query faster.

Some ideas to try:

Increase work_mem as much as you can, that makes sorting faster.

PostgreSQL chooses the explicit sort because it does not know that there are so many identical rows. Otherwise it would choose a much faster hash aggregate. Perhaps we can leverage this:

Try to SET enable_sort = off; for the query to see if this makes PostgreSQL choose the hash aggregate.
Upgrade to PostgreSQL v13, which has become smarter about hash aggregates and is more willing to employ them.