I have a two tables like this: (My actual tables are different. I'm using these to simplify the problem.)
purchases (
item_id,
order_id,
PRIMARY KEY(item_id, order_id)
)
payments (
item_id,
order_id,
payment_id
)
When I issue a query like this I get row estimates that are far too low because the query planner assumes item_id and order_id are independent:
SELECT *
FROM payments
JOIN purchases USING (item_id, order_id)
I can use a query like this to get around the issue:
SELECT *
FROM payments a
JOIN purchases b ON a.item_id || a.order_id = b.item_id || b.order_id
However, that causes other inefficiencies, and the query planner has less flexibility to reason about the query.
Another approach would be to introduce a generated column that concatenates the ids, but it would increase the table's space requirements if it is indexed and it would be equivalent to the prior approach if it were a virtual column.
Is there a way to tell the query planner that how to estimate cardinality on a group of columns without these tradeoffs?
I would try extended statistics on both tables:
CREATE STATISTICS purchases_ext (dependencies)
ON item_id, order_id FROM purchases;
ANALYZE purchases;
CREATE STATISTICS payments_ext (dependencies)
ON item_id, order_id FROM payments;
ANALYZE payments;
Then the optimizer can add a correction factor to the estimate.