Search code examples
sqlpostgresqldatetimeinner-joingreatest-n-per-group

Postgres JOIN on multiple possible columns with OR statement


I have two tables that I want to join together:

contracts:

id customer_id_1 customer_id_2 customer_id_3 date
1 MAIN1 TRAN1 TRAN2 20201101
2 MAIN2 20201001
3 MAIN3 TRAN5 20200901
4 MAIN4 TRAN7 TRAN8 20200801

customers:

id customer_id info date
1 MAIN1 blah 20200930
2 TRAN2 blah 20200929
3 TRAN5 blah 20200831
4 TRAN7 blah 20200801

In my contracts table, each row represents a contract with a customer, who may have 1 or more different IDs they are referred to by in the customers table. In the customers table, I have info on customers (can be zero or multiple records on different dates for each customer). I want to perform a join from contracts onto customers such that I get the most recent info available on a customer at the time a contract is recorded, ignoring any potential customer info that may be available after the contract date. I am also not interested in contracts which have no info on the customers. The main problem here is that in customers, each customer record can reference any 1 of the 3 IDs that may exist.

I currently have the following query which performs the task as intended but the problem is that is extremely slow when run on data in the 50-100k rows range. If I remove the OR statements in the INNER JOIN and just join on the the first ID, the query performs in seconds as opposed to ~ half an hour.

SELECT 
  DISTINCT ON (ctr.id) 
  ctr.id, 
  ctr.customer_id_1, 
  ctr.date AS contract_date, 
  cst.info, 
  cst.date AS info_date
FROM 
  contracts ctr
  INNER JOIN customers cst ON (
    cst.customer_id = ctr.customer_id_1
    OR cst.customer_id = ctr.customer_id_2
    OR cst.customer_id = ctr.customer_id_3
  )
  AND ctr.date >= cst.date
ORDER BY
  ctr.id,
  cst.date DESC

Result:

id customer_id_1 contract_date info info_date
1 MAIN1 20201101 blah 20200930
3 MAIN3 20200901 blah 20200831
4 MAIN4 20200801 blah 20200801

It seems like OR statements in JOINs aren't very common (I've barely found any examples online) and I presume this is because there must be a better way of doing this. So my question is, how can this be optimised?


Solution

  • OR often is a performance killer in SQL predicates.

    One alternative unpivots before joining:

    select distinct on (ctr.id) 
        ctr.id, 
        ctr.customer_id_1, 
        ctr.date as contract_date, 
        cst.info, 
        cst.date as info_date
    from contracts ctr
    cross join lateral (values 
        (ctr.customer_id_1), (ctr.customer_id_2), (ctr.customer_id_3)
    ) as ctx(customer_id)
    inner join customers cst on cst.customer_id = ctx.customer_id and ctr.date >= cst.date
    order by  ctr.id, cst.date desc
    

    The use of this techniques pinpoints that your could vastly improve your data model: the relation between contracts and customers should be stored in a separate table, with each customer/contract tuple on a separate row - essentially, what the query does is virtually build that derived table in the lateral join.