The database schema below represents a fictional shows database called hbtn_0d_tvshows
tables.
tv_shows |
---|
id |
title |
tv_genres |
---|
id |
name |
tv_show_genres |
---|
id |
show_id |
genre_id |
The tv_show_genres
is a pivot table that links tv_shows and tv_genres tables. Now I have been given a task to retrieve rows without the comedy genre in them. A friend came up with the below query and it works.
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE t.`title` NOT IN
(SELECT `title`
FROM `tv_shows` AS t
INNER JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
INNER JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` = "Comedy")
ORDER BY `title`;
My question is, why do we need the subquery when we can put a where clause like this where g.name != "Comedy"
in the outer query. And since all tables are already loaded by the joins, the filter should work. Like below SQL:
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` != "Comedy";
May be I am missing something, but this is a cause of confusion for me and a stumble in my SQL learning.
Expected results
title
Better Call Saul
Breaking Bad
Dexter
Game of Thrones
Homeland
House
Let's start with your second query, which does not provide the desired result:
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` != "Comedy";
Firstly, don't use "Comedy"
(double quotes) for string literals. It works (with default MySQL config) but is non-standard and should be avoided. Use 'Comedy'
(single quotes) instead.
Secondly, because your WHERE
clause is applied to the rightmost table (tv_genres.name
), your LEFT
joins are implicitly changed to INNER
joins.
As already explained by nbk in the comments, your WHERE
clause is just including all rows where g.name != 'Comedy'
, which is not the same as "TV shows that do not have the comedy genre".
Now, let's jump back to your first query:
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE t.`title` NOT IN
(SELECT `title`
FROM `tv_shows` AS t
INNER JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
INNER JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` = "Comedy"'Comedy')
ORDER BY `title`;
The two LEFT
joins in the outer query are doing nothing, so we can get rid of them. This also means DISTINCT
is no longer needed. Instead of joining (for t.`title` NOT IN
) on the title
, we should be using the id
, which has the added bonus of not needing to join tv_shows
in the subquery:
SELECT `title`
FROM `tv_shows` AS t
WHERE t.`id` NOT IN (
SELECT `show_id`
FROM `tv_show_genres` AS s
INNER JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` = 'Comedy'
)
ORDER BY `title`;
This can be rewritten as a LEFT JOIN ... IS NULL
:
SELECT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
AND s.`genre_id` = (SELECT id FROM `tv_genres` WHERE `name` = 'Comedy')
WHERE s.`genre_id` IS NULL
ORDER BY `title`;
or NOT EXISTS
:
SELECT *
FROM tv_shows s
WHERE NOT EXISTS (
SELECT 1
FROM tv_show_genres sg
JOIN tv_genres g ON sg.genre_id = g.id
WHERE sg.show_id = s.id
AND g.name = 'Comedy'
)
ORDER BY title;
Here's a db<>fiddle.
Note that I have changed the indexing of tv_show_genres
in the db<>fiddle. With few exceptions, pivot tables need composite indices in both directions - PK(show_id, genre_id)
and secondary (genre_id, show_id)
.