I have some funny bug with Postgres 9.5.1
I got 2 tables that contain related data contacts (id, name)
and jobs (id, contact_id, name)
.
I'm not sure of the validity of this query (given the curious behavior explained just after).
-- get unassigned contacts
select * from contacts where id not in (select contact_id from jobs);
Edit : The following case was how i tried to analyze the issue. See the end of post and comments to get why the query is not correct.
When testing with a case of contact id=20 without job, I got some (IMO) strange result (a notable difference in results between a select query and a join equivalent).
First, I need to assert some prerequisites (step A). Next, I show the result with join (step B). Finally, I show the result using subquery (step D). (Step C is the complementary request of D and is only here to highlight what I found strange).
select count(distinct id) from contacts;
--> returns 10100
select count(distinct id) from jobs;
--> returns 12000
select count(distinct id) from contacts where id in (select contact_id from jobs);
--> returns 10000
select name from contacts where id=20;
--> returns "NAME"
select id from jobs where contact_id=20;
--> returns nothing (0 row)
select c.id, c.name, j.id
from contacts c
left join jobs j
on j.contact_id=c.id
where c.id=20;
--> returns 20, "NAME", <NULL>
select name from contacts where id in (select contact_id from jobs) and id=20;
--> returns nothing (0 row); (that's the expected result)
select name from contacts where id not in (select contact_id from jobs) and id=20;
--> returns nothing (0 row); (that's not the expected result - "NAME")
C and D queries got the same results.
In logical terms, this could means that in pgsql:
id NOT IN (..values..) == id IN (..values..)
FALSE == TRUE
Can a "Postgres guru" find me a nice explanation or should I call the FBI?
following the answers
My query
select * from contacts where id not in (select contact_id from jobs);
was not correct, because NOT IN
can't handle NULL values. Therefore, it is not the right selector to check (non)existence of a value.
See NULL values inside NOT IN clause.
The correct query is the following :
-- to get unassigned contacts
select * from contacts c where not exists (select 1 from jobs where contact_id=c.id);
For a specified id :
select * from contacts c where not exists (select 1 from jobs where contact_id=c.id) and id=20;
This query works too :
select * from contacts where id not in (select contact_id from jobs where contact_id is not null);
What you are seeing is a null
-safety problem. If any value returned by the not in
subquery is null
, all other values are ignored. We say that not in
is not null-safe.
Imagine that the subquery returns: (1, 2, null)
. The not in
condition becomes:
id <> 1 and id <> 2 and id <> null
The first two conditions evaluate as true, but the last one is unknown
, which contaminates the whole predicate, that, in turns, returns unknown
. As a consequence, all rows are evicted.
This is one of the reason why the use of not in
is usually discouraged. You can simply rewrite this with not exists
:
select name
from contacts c
where c.id = 20 and not exists(select 1 from jobs j where j.contact_id = c.id);