Search code examples
sqlpostgresqlwhere-clauseleft-join

LEFT JOIN … WHERE returning less rows than LEFT JOIN ON + AND


It seems that I totally fail to understand these query results (a PostgreSQL DB):

a) Rows filtered only after joining them:

SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
    ON x.info = a.identifier
WHERE
    a.permit_id IS NULL
    AND a.identifier IS NULL;
count
--------
109242
(1 row)

b) Rows filtered only before joining them:

SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
    ON x.info = a.identifier
    AND a.permit_id IS NULL
WHERE a.identifier IS NULL;
count
--------
111981
(1 row)

c) Rows filtered both before and after joining them:

SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
    ON x.info = a.identifier
    AND a.permit_id IS NULL
WHERE
    a.permit_id IS NULL
    AND a.identifier IS NULL;
count
--------
111981
(1 row)

I don't understand how the queries b and c can possibly return a higher count than query a. Especially I would expect the query c to return no higher count than the query a, yet it returns exactly the same count as the query b.

This is with some real data with millions of rows in the source tables (I've generalised the table and field names), but it seems that I cannot imagine any simpler sets that would yield similar results.

What am I missing here? Is anyone able to provide a simpler example?


Solution

  • This is a weird condition: x.info = a.identifier and then a.identifier IS NULL.

    The = operator won't work on nulls. So every row of a where identifier is null will NOT be joined. The where condition, where a.identifier IS NULL is therefore useless, since it will always be null. Having only data for the leading table, x.

    And there lies the problem.

    Query a is filtered more agressively, because

    • First you join all compdata rows where identifier is not null (actually matching something in apidata)
    • Then remove all rows from the result that had such a match, by checking a.identifier IS NULL.

    For b and c, you filter less agressively:

    • First you join only the rows that have a matching identifier and a matching permit_id, so this actually matches fewer rows than the join of query a. For the other rows (so more than in the first query), the columns of compdata will be null, and those are the rows you get back from this query.
    • Then you filter out all rows that had a matching join.