postgresql optimization query-optimization

Improving performance of a GROUP BY ... HAVING COUNT(...) > 1 in PostgreSQL

I'm trying to select the orders that are part of a trip with multiple orders.

I tried many approaches but can't find how to have a performant query.

To reproduce the problem here is the setup (here it's 100 000 rows, but really it's more 1 000 000 rows to see the timeout on db-fiddle).

Schema (PostgreSQL v14)

create table trips (id bigint primary key);
create table orders (id bigint primary key, trip_id bigint);
create index trips_idx on trips (id);
create index orders_idx on orders (id);
create index orders_trip_idx on orders (trip_id);

insert into trips (id) select seq from generate_series(1,100000) seq;
insert into orders (id, trip_id) select seq, floor(random() * 100000 + 1) from generate_series(1,100000) seq;

Query #1

explain analyze select orders.id
from orders
inner join trips on trips.id = orders.trip_id
inner join orders trips_orders on trips_orders.trip_id = trips.id
group by orders.id, trips.id
having count(trips_orders) > 1
limit 50
;

View on DB Fiddle

Here is what pgmustard gives me on the real query:

Solution

Do you actually need the join on trips? You could try

SELECT shared.id
FROM orders shared
WHERE EXISTS (SELECT * FROM orders other 
              WHERE other.trip_id = shared.trip_id
              AND other.id != shared.id
              )
;

to replace the group by with a hash join, or

SELECT unnest(array_agg(orders.id))
FROM orders
GROUP BY trip_id
HAVING count(*) > 1
;

to hopefully get Postgres to just use the trip_id index.