Search code examples
sqljoin

Translating SEMI JOIN, ANTI JOIN into non-join SQL


Let's start with the following two tables:

CREATE TABLE people AS (
    SELECT * FROM VALUES
    ('david', 10), 
    ('george', 20)
    AS tmp(name, age)
);

CREATE TABLE position AS (
    SELECT * FROM VALUES
    ('george', 'c++'), 
    ('george', 'frontend')
    AS tmp(name, job)
);

Is the following two equivalent ways to write the SEMI JOIN or ANTI JOIN pattern? If not, what is missing?

[SEMI-JOIN == WHERE EXISTS]
SELECT * FROM people SEMI JOIN position ON (people.name=position.name)
SELECT * FROM people WHERE EXISTS (SELECT * FROM position WHERE people.name=position.name)
// or like this?
SELECT * FROM people WHERE name IN (SELECT name FROM position)

[ANTI-JOIN == WHERE NOT EXISTS]
SELECT * FROM people ANTI JOIN position ON (people.name=position.name)
SELECT * FROM people WHERE NOT EXISTS (SELECT * FROM position WHERE people.name=position.name)
// or like this?
SELECT * FROM people WHERE name NOT IN (SELECT name FROM position)


Solution

  • In each example (semi and anti joins), the following pair of options are not the same because of the SELECT * at the start - which will get all columns from all the joined tables.

    If you used SELECT people.* as the start, they are functionally the same as far as I can see.

    In SQL Server at least, the bottom two options of each have identical plans in an example I tried (SQL Server referred to them as Left semi join and Left anti semi join respectively).

    Edit: ANTI JOIN Note that a more analagous way may be to use LEFT OUTER JOIN and then WHERE the joined table doesn't have any rows e.g.,

    SELECT * 
    FROM people LEFT OUTER JOIN position ON (people.name=position.name)
    WHERE position.name IS NULL
    

    All will give the same results, but the EXISTS versions tend to be closer to the theoretical approach.