Search code examples
postgresqlquery-optimization

Slow postgresql query with distinct on subquery


I've been working on improving the query times for few common query patterns in Postgres. The table structure can be simplified to the following

create table file_group (
  id serial primary key,
  tag varchar(64),
  creation_time timestamp
);
create table file (
  id serial primary key,
  file_group_id integer not null references file_group,
  name text,
  ordering int
);

For a given file_group, we can have 0-N files referencing it. For all queries, I need to determine the most recent file_group for each unique tag value, and then retrieve the files associated with them. All of the common queries follow a structure similar to this

select * from
(
    select distinct on (file_group.tag) * 
    from file_group
    where file_group.tag < 'C'  -- dynamic user supplied filter
    order by file_group.tag asc, file_group.creation_time desc
) fg
left join file on fg.id = file.file_group_id  -- left join in case of 0 files in group
order by -- returned ordering of files is important
    fg.tag asc, 
    fg.creation_time desc,
    file.ordering asc;

I've tried a few different indexing approaches to improve the speed of this query pattern. Currently there is an index on file_group (tag, creation_time desc) which works well for the dynamic user supplied filters that are applied to the where clause.

However, I'm currently stuck on the distinct on subquery. When there are a large number of matching rows, the planner uses relatively slow nested loops to join the subquery and file table. I've been struggling to come up with a solution without subqueries using some window function or similar.

The desired output can be seen in this fiddle here.

Some other details

  • In reality the file_group table has multiple tag columns, previously this was stored in a single HSTORE column.
  • The file_group table has ~50M records and the file table has ~60M. Most file groups have a single file in them, but some have 0 and some have many.
  • Inserts happen frequently and returned data is expected to be up-to-date so I don't believe a materialized view is a good approach.

Full Details

Sample Query

select file.name, file.creation_time, file.metadata, fg.partition_1, fg.partition_2, fg.partition_3, fg.partition_4
from (
    select
        distinct on (file_group.partition_1, file_group.partition_2, file_group.partition_3, file_group.partition_4)
        file_group.id, file_group.partition_1, file_group.partition_2, file_group.partition_3, file_group.partition_4
    from file_group
    where (
        file_group.dataset_id = 123
        and file_group.partition_2 > '2024-03-15'
        and file_group.partition_2 < '2024-04-15'
    )
    order by
        file_group.partition_1,
        file_group.partition_2,
        file_group.partition_3,
        file_group.partition_4,
        file_group.creation_time desc
) as fg
left join file on fg.id = file.file_group_id
order by
    fg.partition_1,
    fg.partition_2,
    fg.partition_3,
    fg.partition_4,
    file.ordering,
    file.creation_time;

Explain analyze from a beefy server.

Incremental Sort  (cost=234.12..4233792.05 rows=24340445 width=449) (actual time=221.228..875.170 rows=169594 loops=1)
"  Sort Key: file_group.partition_1, file_group.partition_2, file_group.partition_3, file_group.partition_4, file.ordering, file.creation_time"
"  Presorted Key: file_group.partition_1, file_group.partition_2, file_group.partition_3, file_group.partition_4"
  Full-sort Groups: 5300  Sort Method: quicksort  Average Memory: 41kB  Peak Memory: 41kB
  ->  Nested Loop Left Join  (cost=1.13..1874018.77 rows=24340445 width=449) (actual time=221.119..799.179 rows=169594 loops=1)
        ->  Result  (cost=0.56..285118.45 rows=167341 width=187) (actual time=221.088..542.486 rows=169594 loops=1)
              ->  Unique  (cost=0.56..285118.45 rows=167341 width=187) (actual time=110.729..418.591 rows=169594 loops=1)
                    ->  Index Scan using file_group_full_nulls_last on file_group  (cost=0.56..283416.93 rows=170152 width=187) (actual time=110.727..389.307 rows=225379 loops=1)
                          Index Cond: ((dataset_id = 123) AND ((partition_2)::text > '2024-03-15'::text) AND ((partition_2)::text < '2024-04-15'::text))
        ->  Index Scan using file_file_group_id on file  (cost=0.56..8.03 rows=145 width=278) (actual time=0.001..0.001 rows=1 loops=169594)
              Index Cond: (file_group_id = file_group.id)
Planning Time: 0.221 ms
JIT:
  Functions: 10
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 0.773 ms, Inlining 7.425 ms, Optimization 62.685 ms, Emission 40.250 ms, Total 111.133 ms"
Execution Time: 887.591 ms

Index

create index file_group_full_nulls_last
    on file_group (dataset_id asc, partition_1 asc, partition_2 asc, partition_3 asc, partition_4 asc,
                        creation_time desc);

No nested loop query plan:

Sort  (cost=9763486.28..9817497.33 rows=21604417 width=344) (actual time=9484.202..9533.795 rows=169594 loops=1)
"  Sort Key: fg.partition_1, fg.partition_2, fg.partition_3, fg.partition_4, file_flat.ordering, file_flat.creation_time"
  Sort Method: external merge  Disk: 64672kB
  ->  Merge Right Join  (cost=295577.24..3152452.21 rows=21604417 width=344) (actual time=4564.037..8972.023 rows=169594 loops=1)
        Merge Cond: (file_flat.file_group_id = fg.id)
        ->  Index Scan using file_flat_file_group_id on file_flat  (cost=0.56..2405926.39 rows=50601548 width=173) (actual time=0.015..5724.595 rows=50061736 loops=1)
        ->  Materialize  (cost=295576.68..296334.70 rows=151605 width=179) (actual time=469.033..491.898 rows=169594 loops=1)
              ->  Sort  (cost=295576.68..295955.69 rows=151605 width=179) (actual time=469.028..476.677 rows=169594 loops=1)
                    Sort Key: fg.id
                    Sort Method: quicksort  Memory: 19753kB
                    ->  Subquery Scan on fg  (cost=0.56..274638.60 rows=151605 width=179) (actual time=109.232..424.851 rows=169594 loops=1)
                          ->  Result  (cost=0.56..273122.55 rows=151605 width=187) (actual time=109.229..413.545 rows=169594 loops=1)
                                ->  Unique  (cost=0.56..273122.55 rows=151605 width=187) (actual time=109.214..400.298 rows=169594 loops=1)
                                      ->  Index Scan using file_group_flat_full_nulls_last on file_group_flat  (cost=0.56..271583.47 rows=153908 width=187) (actual time=109.212..374.579 rows=225379 loops=1)
                                            Index Cond: ((dataset_id = 123) AND ((partition_2)::text > '2024-03-15'::text) AND ((partition_2)::text < '2024-04-15'::text))
Planning Time: 0.385 ms
JIT:
  Functions: 13
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 1.006 ms, Inlining 5.238 ms, Optimization 87.774 ms, Emission 56.064 ms, Total 150.082 ms"
Execution Time: 9561.524 ms

Solution

  • The time spent in JIT is only about 10%, but is unlikely to be worthwhile. As your filter gets more restrictive, the JIT time will not change much and so will take up an increasing amount of the lower total run time. I would just turn JIT off system-wide, as in my experience is harms far more than it helps and never should have been turned on by default in the first place.

    At some point the planner should just decide jit is not worthwhile based on jit_above_cost, but your cost estimate is so inflated (due to poor row estimate on the nested loop which is turn due to the poor row estimate on Index Scan using file_file_group_id) that this is unlikely to be reliable. It would be nice if the estimate were better, but if JIT was just turned off it probably wouldn't make much further difference.

    In your nested loop, it is the first (single, bulk) index scan which is mostly the slow step, it is slower than all the iterations of the 2nd index scan combined. Increasing the selectivity of that first scan should improve the time spend on the 2nd one as it will reduce the number of iteration, but since that is only a minority of the total time it wouldn't be all that important.

    Whether making it more selective will make the first index scan faster depends on the details. If you change to smaller dataset, that should certainly make it faster. But if you make the range on partition_2 narrower, that might not. Because partition_2 is not the 2nd column in the index, it is basically applied as an in-index filter, so it still needs to do a lot of work inside the index which is proportional to the number of rows meeting dataset_id, not proportional to the final number of rows meeting both conditions. If you reversed the order of partition_1 and partition_2 in the index, then it could be used much more efficiently to apply the range filter. But it would no longer be useful for supplying rows in order by partition_1. Which of these two is more important would best determined by trying it.

    When you disabled the nested loop, I'm surprised that it switched to a merge join rather than a hash join (but I'm not surprised that it got slower). Maybe that choice of join was due to an older version of PostgreSQL or due to a low setting of work_mem, but if you did get a hash join it is still unlikely to be faster. Needing to read 60 million rows and hash them into memory is unlikely to be the fastest way when only 150,000 of them are needed.