Search code examples
postgresqloptimizationquery-optimization

Improving performance of a GROUP BY ... HAVING COUNT(...) > 1 in PostgreSQL


I'm trying to select the orders that are part of a trip with multiple orders.

I tried many approaches but can't find how to have a performant query.

To reproduce the problem here is the setup (here it's 100 000 rows, but really it's more 1 000 000 rows to see the timeout on db-fiddle).

Schema (PostgreSQL v14)

create table trips (id bigint primary key);
create table orders (id bigint primary key, trip_id bigint);
create index trips_idx on trips (id);
create index orders_idx on orders (id);
create index orders_trip_idx on orders (trip_id);

insert into trips (id) select seq from generate_series(1,100000) seq;
insert into orders (id, trip_id) select seq, floor(random() * 100000 + 1) from generate_series(1,100000) seq;

Query #1

explain analyze select orders.id
from orders
inner join trips on trips.id = orders.trip_id
inner join orders trips_orders on trips_orders.trip_id = trips.id
group by orders.id, trips.id
having count(trips_orders) > 1
limit 50
;

View on DB Fiddle

Here is what pgmustard gives me on the real query:

pgmustard


Solution

  • Do you actually need the join on trips? You could try

    SELECT shared.id
    FROM orders shared
    WHERE EXISTS (SELECT * FROM orders other 
                  WHERE other.trip_id = shared.trip_id
                  AND other.id != shared.id
                  )
    ;
    

    to replace the group by with a hash join, or

    SELECT unnest(array_agg(orders.id))
    FROM orders
    GROUP BY trip_id
    HAVING count(*) > 1
    ;
    

    to hopefully get Postgres to just use the trip_id index.