sql postgresql greatest-n-per-group query-planner

Query plan difference inner join/right join "greatest-n-per-group", self joined, aggregated query

For a small Postgres 10 data warehouse I was checking for improvements in our analytics queries and discovered a rather slow query where the possible improvement basically boiled down to this subquery (classic greatest-n-per-group problem):

SELECT s_postings.*
FROM dwh.s_postings
JOIN (SELECT s_postings.id,
          max(s_postings.load_dts) AS load_dts
      FROM dwh.s_postings
      GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts

With the following execution plan:

"Gather  (cost=23808.51..38602.59 rows=66 width=376) (actual time=1385.927..1810.844 rows=170847 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Hash Join  (cost=22808.51..37595.99 rows=28 width=376) (actual time=1199.647..1490.652 rows=56949 loops=3)"
"        Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
"        ->  Parallel Seq Scan on s_postings  (cost=0.00..14113.25 rows=128425 width=376) (actual time=0.016..73.604 rows=102723 loops=3)"
"        ->  Hash  (cost=20513.00..20513.00 rows=153034 width=75) (actual time=1195.616..1195.616 rows=170847 loops=3)"
"              Buckets: 262144  Batches: 1  Memory Usage: 20735kB"
"              ->  HashAggregate  (cost=17452.32..18982.66 rows=153034 width=75) (actual time=836.694..1015.499 rows=170847 loops=3)"
"                    Group Key: s_postings_1.id"
"                    ->  Seq Scan on s_postings s_postings_1  (cost=0.00..15911.21 rows=308221 width=75) (actual time=0.032..251.122 rows=308168 loops=3)"
"Planning time: 1.184 ms"
"Execution time: 1912.865 ms"

The row estimate is absolutely wrong! For me the weird thing is if I change the join to a right join now:

SELECT s_postings.*
FROM dwh.s_postings
RIGHT JOIN (SELECT s_postings.id,
      max(s_postings.load_dts) AS load_dts
   FROM dwh.s_postings
   GROUP BY s_postings.id) AS current_postings
ON s_postings.id = current_postings.id AND s_postings.load_dts = current_postings.load_dts

With the execution plan:

"Hash Right Join  (cost=22829.85..40375.62 rows=153177 width=376) (actual time=814.097..1399.673 rows=170848 loops=1)"
"  Hash Cond: (((s_postings.id)::text = (s_postings_1.id)::text) AND (s_postings.load_dts = (max(s_postings_1.load_dts))))"
"  ->  Seq Scan on s_postings  (cost=0.00..15926.10 rows=308510 width=376) (actual time=0.011..144.584 rows=308419 loops=1)"
"  ->  Hash  (cost=20532.19..20532.19 rows=153177 width=75) (actual time=812.587..812.587 rows=170848 loops=1)"
"        Buckets: 262144  Batches: 1  Memory Usage: 20735kB"
"        ->  HashAggregate  (cost=17468.65..19000.42 rows=153177 width=75) (actual time=553.633..683.850 rows=170848 loops=1)"
"              Group Key: s_postings_1.id"
"              ->  Seq Scan on s_postings s_postings_1  (cost=0.00..15926.10 rows=308510 width=75) (actual time=0.011..157.000 rows=308419 loops=1)"
"Planning time: 0.402 ms"
"Execution time: 1469.808 ms"

The row estimate is way better!

I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!? If I remember correctly aggregate functions also block the proper use of indexes anyway and also don’t see any potential gains with additional multivariate statistics e.g. for the tuple id, load_dts. The database is VACUUM ANALYZEd.

For me the queries are logically the same.

Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?

Edit: Previously the join condition was ON s_postings.id::text = current_postings.id::text I changed that to ON s_postings.id = current_postings.id to not confuse anybody. Removing this conversion does not change the query plan.

Edit2: As suggested below there is a different solution to the greatest-n-per-group problem:

SELECT p.*
FROM (SELECT p.*,
             RANK() OVER (PARTITION BY p.id ORDER BY p.load_dts DESC) as seqnum
      FROM dwh.s_postings p
     ) p
WHERE seqnum = 1;

A really nice solution but sadly the query planner also underestimates the row count:

"Subquery Scan on p  (cost=44151.67..54199.31 rows=1546 width=384) (actual time=1742.902..2594.359 rows=171269 loops=1)"
"  Filter: (p.seqnum = 1)"
"  Rows Removed by Filter: 137803"
"  ->  WindowAgg  (cost=44151.67..50334.83 rows=309158 width=384) (actual time=1742.899..2408.240 rows=309072 loops=1)"
"        ->  Sort  (cost=44151.67..44924.57 rows=309158 width=376) (actual time=1742.887..1927.325 rows=309072 loops=1)"
"              Sort Key: p_1.id, p_1.load_dts DESC"
"              Sort Method: quicksort  Memory: 172275kB"
"              ->  Seq Scan on s_postings p_1  (cost=0.00..15959.58 rows=309158 width=376) (actual time=0.007..221.240 rows=309072 loops=1)"
"Planning time: 0.149 ms"
"Execution time: 2666.645 ms"

Solution

The difference in timing is not very large. It could easily just be caching effects. If you alternate between them back-to-back repeatedly, do you still get the difference? If you disable parallel execution by setting max_parallel_workers_per_gather = 0, does that equalize them?

The row estimate is absolutely wrong!

While this is obviously true, I don't think the misestimation is causing anything particularly bad to happen.

I am aware that for example parallel sequential scans can in some conditions decrease performance but they should not change the row estimate!?

Correct. It is the change in the JOIN type that causes the estimation change, and that in turn causes the change in parallelization. Thinking it has to push more tuples up to the leader (rather than disqualifying them down in the workers) discourages parallel plans, due to parallel_tuple_cost.

If I remember correctly aggregate functions also block the proper use of indexes

No, an index on (id, load_dts) or even just (id) should be usable for doing the aggregation, but since you need to read the entire table, it will probably be slower to read the entire index and entire table, than it is to just read the entire table into a HashAgg. You can test if PostgreSQL thinks it is capable of using such an index by setting enable_seqscan=off. If it does the seq scan anyway, then it doesn't think the index is usable. Otherwise, it just thinks using the index is counterproductive.

Is there a way to support the query planner to make better assumptions about the estimates or improve the query? Maybe somebody knows a reason why this difference exists?

The planner lacks the insight to know that every id,max(load_dts) from the derived table must have come from at least one row in the original table. Instead it applies the two conditions in the ON as independent variables, and doesn't even know what the most common values/histograms for your derived table will be so can't predict the degree of overlap. But with the RIGHT JOIN, it knows that every row in the derived table gets returned, whether a match is found in the "other" table or not. If you create a temp table from your derived subquery and ANALYZE it then use that table in the join, you should get better estimates because it at least knows how much the distributions in each column overlap. But those better estimate are not likely to load to hugely better plans, so I wouldn't bother with that complexity.

You can probably get some marginal speed up by rewriting it into a DISTINCT ON query, but it won't be magically better. Also note that these are not equivalent. The join will return all rows which are tied for first place within a given id, while DISTINCT ON will return an arbitrary one of them (unless you add columns to the ORDER BY to break the ties)