Search code examples
sqldatabaseperformanceprocessing-efficiency

SELECT ... WHERE IN vs JOIN


Assume we have 4 tables A, B, C, D in some unspecified relational SQL database. A references B, also C and D as well. References means that A has column A.refX_id = X.id, X is A, B and C (common foreign key 1:N).

What I want, is to query table A with condition based on columns from all child tables B, C and D. My question is: What of the following variants is generally better? (In terms of usability, efficiency, speed.)

Variant 1:

SELECT DISTINCT A.* FROM A
   JOIN B ON A.refB_id = B.id
   JOIN C ON A.refC_id = C.id
   JOIN D ON A.refD_id = D.id
   WHERE <condition on B> AND <condition on C> AND <condition on D>;

Which I like more from perspective of a database, but looks a little bit harder to program.

Variant 2:

SELECT id FROM B WHERE <condition on B>; # result store to array "BIds" on program side
SELECT id FROM C WHERE <condition on C>; # result store to array "CIds" on program side
SELECT id FROM D WHERE <condition on D>; # result store to array "DIds" on program side

SELECT A.* FROM A
   WHERE refB_id IN (<B_ids>) AND refC_id IN (<C_ids>) AND refD_id IN (<D_ids>);

# <B_ids> menas expand whole array of ids, which can result in a very long query string

I thought that Variant 2 is completelly smut and unusable with potentially large data. But I've heard, that a lot of frameworks normally use it, because it is relativelly straightforward. Is it sort of legal way to query data like this in general case, if I know that the content of the "IN" clause is taken from a result of another query(ies)?


Solution

  • I would encourage you to use IN or EXISTS:

    SELECT A.*
    FROM A
    WHERE EXISTS (SELECT 1 FROM B WHERE A.refB_id = B.id AND <condition on B>) AND
          EXISTS (SELECT 1 FROM C WHERE A.refC_id = C.id AND <condition on C>) AND
          EXISTS (SELECT 1 FROM D WHERE A.refD_id = D.id AND <condition on D>);
    

    Advantages of this approach:

    • No danger of getting large intermediate Cartesian products.
    • No expense of eliminating duplicates for SELECT DISTINCT.
    • Most databases handle EXISTS pretty well.
    • You can optimize each subquery using indexes.

    EDIT:

    You can write this with IN and subqueries:

    SELECT A.*
    FROM A
    WHERE A.refB_id IN (SELECT B.id FROM B WHERE <condition on B>) AND
          A.refC_id IN (SELECT C.id FROM C WHERE <condition on C>) AND
          A.refD_id IN (SELECT D.id FROM D WHERE <condition on D>);