Search code examples
sqlmysqljoin

Understanding joining of three tables in MySQL


The database schema below represents a fictional shows database called hbtn_0d_tvshows tables.

tv_shows
id
title
tv_genres
id
name
tv_show_genres
id
show_id
genre_id

The tv_show_genres is a pivot table that links tv_shows and tv_genres tables. Now I have been given a task to retrieve rows without the comedy genre in them. A friend came up with the below query and it works.

SELECT DISTINCT `title`
  FROM `tv_shows` AS t
       LEFT JOIN `tv_show_genres` AS s
       ON s.`show_id` = t.`id`

       LEFT JOIN `tv_genres` AS g
       ON g.`id` = s.`genre_id`
       WHERE t.`title` NOT IN
             (SELECT `title`
                FROM `tv_shows` AS t
                 INNER JOIN `tv_show_genres` AS s
             ON s.`show_id` = t.`id`

             INNER JOIN `tv_genres` AS g
             ON g.`id` = s.`genre_id`
             WHERE g.`name` = "Comedy")
 ORDER BY `title`;

My question is, why do we need the subquery when we can put a where clause like this where g.name != "Comedy" in the outer query. And since all tables are already loaded by the joins, the filter should work. Like below SQL:

SELECT DISTINCT `title`
      FROM `tv_shows` AS t
           LEFT JOIN `tv_show_genres` AS s
           ON s.`show_id` = t.`id`

           LEFT JOIN `tv_genres` AS g
           ON g.`id` = s.`genre_id`
           WHERE g.`name` != "Comedy";

May be I am missing something, but this is a cause of confusion for me and a stumble in my SQL learning.

Sample database with data

Expected results

title
Better Call Saul
Breaking Bad
Dexter
Game of Thrones
Homeland
House

Solution

  • Let's start with your second query, which does not provide the desired result:

    SELECT DISTINCT `title`
    FROM `tv_shows` AS t
    LEFT JOIN `tv_show_genres` AS s
        ON s.`show_id` = t.`id`
    LEFT JOIN `tv_genres` AS g
        ON g.`id` = s.`genre_id`
    WHERE g.`name` != "Comedy";
    

    Firstly, don't use "Comedy" (double quotes) for string literals. It works (with default MySQL config) but is non-standard and should be avoided. Use 'Comedy' (single quotes) instead.

    Secondly, because your WHERE clause is applied to the rightmost table (tv_genres.name), your LEFT joins are implicitly changed to INNER joins.

    As already explained by nbk in the comments, your WHERE clause is just including all rows where g.name != 'Comedy', which is not the same as "TV shows that do not have the comedy genre".

    Now, let's jump back to your first query:

    
    SELECT DISTINCT `title`
      FROM `tv_shows` AS t
           LEFT JOIN `tv_show_genres` AS s
           ON s.`show_id` = t.`id`
    
           LEFT JOIN `tv_genres` AS g
           ON g.`id` = s.`genre_id`
           WHERE t.`title` NOT IN
                 (SELECT `title`
                    FROM `tv_shows` AS t
                     INNER JOIN `tv_show_genres` AS s
                 ON s.`show_id` = t.`id`
    
                 INNER JOIN `tv_genres` AS g
                 ON g.`id` = s.`genre_id`
                 WHERE g.`name` = "Comedy"'Comedy')
     ORDER BY `title`;
    

    The two LEFT joins in the outer query are doing nothing, so we can get rid of them. This also means DISTINCT is no longer needed. Instead of joining (for t.`title` NOT IN) on the title, we should be using the id, which has the added bonus of not needing to join tv_shows in the subquery:

    SELECT `title`
    FROM `tv_shows` AS t
    WHERE t.`id` NOT IN (
        SELECT `show_id`
        FROM `tv_show_genres` AS s
        INNER JOIN `tv_genres` AS g
            ON g.`id` = s.`genre_id`
        WHERE g.`name` = 'Comedy'
    )
     ORDER BY `title`;
    

    This can be rewritten as a LEFT JOIN ... IS NULL:

    SELECT `title`
    FROM `tv_shows` AS t
    LEFT JOIN `tv_show_genres` AS s
        ON s.`show_id` = t.`id`
        AND s.`genre_id` = (SELECT id FROM `tv_genres` WHERE `name` = 'Comedy')
    WHERE s.`genre_id` IS NULL
    ORDER BY `title`;
    

    or NOT EXISTS:

    SELECT *
    FROM tv_shows s
    WHERE NOT EXISTS (
        SELECT 1
        FROM tv_show_genres sg
        JOIN tv_genres g ON sg.genre_id = g.id
        WHERE sg.show_id = s.id
        AND g.name = 'Comedy'
    )
    ORDER BY title;
    

    Here's a db<>fiddle.

    Note that I have changed the indexing of tv_show_genres in the db<>fiddle. With few exceptions, pivot tables need composite indices in both directions - PK(show_id, genre_id) and secondary (genre_id, show_id).