Search code examples
postgresqlquery-planner

Is there a way to explicitly tell the Postgres query planner that columns are dependent?


I have a two tables like this: (My actual tables are different. I'm using these to simplify the problem.)

purchases (
  item_id,
  order_id,
  PRIMARY KEY(item_id, order_id)
)
payments (
  item_id,
  order_id,
  payment_id
)

When I issue a query like this I get row estimates that are far too low because the query planner assumes item_id and order_id are independent:

SELECT *
FROM payments
JOIN purchases USING (item_id, order_id)

I can use a query like this to get around the issue:

SELECT *
FROM payments a
JOIN purchases b ON a.item_id || a.order_id = b.item_id || b.order_id

However, that causes other inefficiencies, and the query planner has less flexibility to reason about the query.

Another approach would be to introduce a generated column that concatenates the ids, but it would increase the table's space requirements if it is indexed and it would be equivalent to the prior approach if it were a virtual column.

Is there a way to tell the query planner that how to estimate cardinality on a group of columns without these tradeoffs?


Solution

  • I would try extended statistics on both tables:

    CREATE STATISTICS purchases_ext (dependencies)
       ON item_id, order_id FROM purchases;
    
    ANALYZE purchases;
    
    CREATE STATISTICS payments_ext (dependencies)
       ON item_id, order_id FROM payments;
    
    ANALYZE payments;
    

    Then the optimizer can add a correction factor to the estimate.