sql postgresql datetime inner-join greatest-n-per-group

Postgres JOIN on multiple possible columns with OR statement

I have two tables that I want to join together:

`contracts`:

id	customer_id_1	customer_id_2	customer_id_3	date
1	MAIN1	TRAN1	TRAN2	20201101
2	MAIN2			20201001
3	MAIN3	TRAN5		20200901
4	MAIN4	TRAN7	TRAN8	20200801

`customers`:

id	customer_id	info	date
1	MAIN1	blah	20200930
2	TRAN2	blah	20200929
3	TRAN5	blah	20200831
4	TRAN7	blah	20200801

In my contracts table, each row represents a contract with a customer, who may have 1 or more different IDs they are referred to by in the customers table. In the customers table, I have info on customers (can be zero or multiple records on different dates for each customer). I want to perform a join from contracts onto customers such that I get the most recent info available on a customer at the time a contract is recorded, ignoring any potential customer info that may be available after the contract date. I am also not interested in contracts which have no info on the customers. The main problem here is that in customers, each customer record can reference any 1 of the 3 IDs that may exist.

I currently have the following query which performs the task as intended but the problem is that is extremely slow when run on data in the 50-100k rows range. If I remove the OR statements in the INNER JOIN and just join on the the first ID, the query performs in seconds as opposed to ~ half an hour.

SELECT 
  DISTINCT ON (ctr.id) 
  ctr.id, 
  ctr.customer_id_1, 
  ctr.date AS contract_date, 
  cst.info, 
  cst.date AS info_date
FROM 
  contracts ctr
  INNER JOIN customers cst ON (
    cst.customer_id = ctr.customer_id_1
    OR cst.customer_id = ctr.customer_id_2
    OR cst.customer_id = ctr.customer_id_3
  )
  AND ctr.date >= cst.date
ORDER BY
  ctr.id,
  cst.date DESC

Result:

id	customer_id_1	contract_date	info	info_date
1	MAIN1	20201101	blah	20200930
3	MAIN3	20200901	blah	20200831
4	MAIN4	20200801	blah	20200801

It seems like OR statements in JOINs aren't very common (I've barely found any examples online) and I presume this is because there must be a better way of doing this. So my question is, how can this be optimised?

Solution

OR often is a performance killer in SQL predicates.

One alternative unpivots before joining:

select distinct on (ctr.id) 
    ctr.id, 
    ctr.customer_id_1, 
    ctr.date as contract_date, 
    cst.info, 
    cst.date as info_date
from contracts ctr
cross join lateral (values 
    (ctr.customer_id_1), (ctr.customer_id_2), (ctr.customer_id_3)
) as ctx(customer_id)
inner join customers cst on cst.customer_id = ctx.customer_id and ctr.date >= cst.date
order by  ctr.id, cst.date desc

The use of this techniques pinpoints that your could vastly improve your data model: the relation between contracts and customers should be stored in a separate table, with each customer/contract tuple on a separate row - essentially, what the query does is virtually build that derived table in the lateral join.

Postgres JOIN on multiple possible columns with OR statement

contracts:

customers:

Result:

`contracts`:

`customers`: