It seems that I totally fail to understand these query results (a PostgreSQL DB):
a) Rows filtered only after joining them:
SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
ON x.info = a.identifier
WHERE
a.permit_id IS NULL
AND a.identifier IS NULL;
count
--------
109242
(1 row)
b) Rows filtered only before joining them:
SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
ON x.info = a.identifier
AND a.permit_id IS NULL
WHERE a.identifier IS NULL;
count
--------
111981
(1 row)
c) Rows filtered both before and after joining them:
SELECT COUNT(*)
FROM apidata.info x
LEFT JOIN compdata.info a
ON x.info = a.identifier
AND a.permit_id IS NULL
WHERE
a.permit_id IS NULL
AND a.identifier IS NULL;
count
--------
111981
(1 row)
I don't understand how the queries b and c can possibly return a higher count than query a. Especially I would expect the query c to return no higher count than the query a, yet it returns exactly the same count as the query b.
This is with some real data with millions of rows in the source tables (I've generalised the table and field names), but it seems that I cannot imagine any simpler sets that would yield similar results.
What am I missing here? Is anyone able to provide a simpler example?
This is a weird condition:
x.info = a.identifier
and then a.identifier IS NULL
.
The =
operator won't work on nulls. So every row of a where identifier is null will NOT be joined. The where condition, where a.identifier IS NULL
is therefore useless, since it will always be null. Having only data for the leading table, x.
And there lies the problem.
Query a is filtered more agressively, because
a.identifier IS NULL
.For b and c, you filter less agressively: