postgresql aggregate partitioning query-performance

How to speed up server-side aggregations over partitioned table

Background

I'm generating Tukey box-and-whisker plots from a normalized, partitioned table in Postgres 10.6 on RDS with 3*10^9 rows.

I've started by using multiple views in succession consisting of an aggregation step and a subsequent outlier detection step. First, in the aggregation step, I calculate the median, 25%, 75%, IQR, (25% - 1.5*IQR) lower whisker, and (75% + 1.5*IQR) upper whisker. Second, in the outlier detection step, I search the table for values that lie outside the whiskers.

Aggregation

create view aggregation as
select
    a.a_name,
    b.b_name,
    c.c_name,
    percentile_cont(0.5) within group (order by d.D) as median,
    etc for 75%, IQR, whiskers
from dtable as d
join atable as a on a.a_id = d.a_id
join etable as e on e.e_id = d.e_id
join ftable as f on f.f_id = e.f_id
join btable as b on b.b_id = f.b_id
join ctable as c on c.c_id = b.c_id
where (d.e_id between 3440500 and 3459500)
and (c.c_name = 'this_c_in_particular')
and (b.b_name in ('first_b', 'second_b', 'third_b'))
group by
a.a_name,
b.b_name,
c.c_name
;

Note that the dtable is partitioned by e_id

Outlier detection

create view outliers as
select d.*
from dtable as d
join atable, etable, ftable, btable, ctable
join aggregation as agg on
    agg.a_name = atable.a_name,
    agg.b_name = btable.b_name,
    agg.c_name = ctable.c_name
where d.value < agg.lower_whisker or d.value > agg.upper_whisker
;

Results

Currently, using a flat client-side pandas dataframe, I can perform these aggregations in under 10 seconds after the network transfer and server-side downsampling. However, on the client-side, these aggregations take at least 1 minute to run.

The (EXPLAIN ANALYZE) plan is available here: https://explain.depesz.com/s/0gAu

Question

Is there a standard method for calculating these aggregations quickly?
Is there a way to get Postgres to calculate these things in parallel, 1 worker-per-group?

Any insight or discussion is more than welcome - thanks for reading.

Solution

The execution plan has some things I don't understand:

Why is there a Gather node if no parallel workers were planned? From the loops I'd expect two workers.
Why does expain.depesz.com not count the 895693 iterations of the bottom node (maybe it is as confused by the above as I am)?

Nonetheless, some problems can be detected right away:

There are terrible mis-estimates (725 instead of 895693 actual rows!).
Most of your time is spent in a sort that spills to disk.

So here is what you can improve without rewriting the query:

Increase work_mem until the sort is a quicksort memory. That should be the biggest gain.

You don't have to increase it globally, you could run something like:
```
BEGIN;
SET LOCAL work_mem = '1GB';
SELECT /* your query */;
COMMIT;
```
Some of the tables seem to have stale statistics. Try to ANALYZE all tables in question, maybe that does some good.
You can probably scrape off a few more seconds by avoiding the ill-guided nested loop joins. Perhaps the ANALYZE will take care of that.

As a last resort, you could simple disable nested loops for that query, by setting enable_nestloop = off for the one query with the same trick that I showed for work_mem above.

The scan on the partitioned table is not your problem, so you don't have to worry about parallelizing that (PostgreSQL v11 has become smarter with that).

If all of the above does not make the query fast enough, you can consider using a materialized view. Then you get slightly stale data, but fast.