I have two tables that I want to join together:
contracts
:id | customer_id_1 | customer_id_2 | customer_id_3 | date |
---|---|---|---|---|
1 | MAIN1 | TRAN1 | TRAN2 | 20201101 |
2 | MAIN2 | 20201001 | ||
3 | MAIN3 | TRAN5 | 20200901 | |
4 | MAIN4 | TRAN7 | TRAN8 | 20200801 |
customers
:id | customer_id | info | date |
---|---|---|---|
1 | MAIN1 | blah | 20200930 |
2 | TRAN2 | blah | 20200929 |
3 | TRAN5 | blah | 20200831 |
4 | TRAN7 | blah | 20200801 |
In my contracts
table, each row represents a contract with a customer, who may have 1 or more different IDs they are referred to by in the customers
table. In the customers
table, I have info on customers (can be zero or multiple records on different dates for each customer). I want to perform a join from contracts
onto customers
such that I get the most recent info available on a customer at the time a contract is recorded, ignoring any potential customer info that may be available after the contract date. I am also not interested in contracts which have no info on the customers. The main problem here is that in customers
, each customer record can reference any 1 of the 3 IDs that may exist.
I currently have the following query which performs the task as intended but the problem is that is extremely slow when run on data in the 50-100k rows range. If I remove the OR
statements in the INNER JOIN
and just join on the the first ID, the query performs in seconds as opposed to ~ half an hour.
SELECT
DISTINCT ON (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date AS contract_date,
cst.info,
cst.date AS info_date
FROM
contracts ctr
INNER JOIN customers cst ON (
cst.customer_id = ctr.customer_id_1
OR cst.customer_id = ctr.customer_id_2
OR cst.customer_id = ctr.customer_id_3
)
AND ctr.date >= cst.date
ORDER BY
ctr.id,
cst.date DESC
id | customer_id_1 | contract_date | info | info_date |
---|---|---|---|---|
1 | MAIN1 | 20201101 | blah | 20200930 |
3 | MAIN3 | 20200901 | blah | 20200831 |
4 | MAIN4 | 20200801 | blah | 20200801 |
It seems like OR
statements in JOIN
s aren't very common (I've barely found any examples online) and I presume this is because there must be a better way of doing this. So my question is, how can this be optimised?
OR
often is a performance killer in SQL predicates.
One alternative unpivots before joining:
select distinct on (ctr.id)
ctr.id,
ctr.customer_id_1,
ctr.date as contract_date,
cst.info,
cst.date as info_date
from contracts ctr
cross join lateral (values
(ctr.customer_id_1), (ctr.customer_id_2), (ctr.customer_id_3)
) as ctx(customer_id)
inner join customers cst on cst.customer_id = ctx.customer_id and ctr.date >= cst.date
order by ctr.id, cst.date desc
The use of this techniques pinpoints that your could vastly improve your data model: the relation between contracts and customers should be stored in a separate table, with each customer/contract tuple on a separate row - essentially, what the query does is virtually build that derived table in the lateral join.