Search code examples
mysqlimdb

Querying from IMDB Database using MySQL


I wrote a SQL query to answer the following question:

Find all the actors that made more movies with Yash Chopra than any other director in the IMBD database.

Sample schema:

person
(pid *
,name
);

m_cast
(mid *
,pid *
);

m_director
(mid*
,pid*
);

* = (component of) PRIMARY KEY

Following is my query:

WITH common_actors AS 
        (SELECT A.actor_id as actors, B.director_id as director_id, B.movies as movies_with_director,
        B.director_id as yash_chops_id, B.movies as movies_with_yash_chops FROM
        (SELECT M_Cast.PID as actor_id, M_Director.PID as director_id, COUNT(*) as movies from M_Cast
        left join M_Director  
        ON M_Cast.MID = M_Director.MID
        GROUP BY actor_id, director_id) A
        JOIN
        (SELECT M_Cast.PID as actor_id, M_Director.PID as director_id, COUNT(*) as movies from M_Cast
        left join M_Director  
        ON M_Cast.MID = M_Director.MID
        GROUP BY actor_id, director_id
        )B
        ON A.actor_id = B.actor_id
        WHERE B.director_id in (SELECT PID FROM Person WHERE Name LIKE 
        '%Yash%Chopra%'))

SELECT distinct actors as actor_id, movies_with_yash_chops as total_movies FROM common_actors
    WHERE actors NOT IN (SELECT actors FROM common_actors WHERE movies_with_director > movies_with_yash_chops)

And the result obtained from this is of length: 430 rows. However the result obtained should be of length 243 rows. Could anyone please suggest where I went wrong in my query? Is my approach right?

Sample result:

    Actor name
  0 Sharib Hashmi
  1 Kulbir Badesron
  2 Gurdas Maan
  3 Parikshat Sahni
...
242 Ramlal Shyamlal

Thanks in advance!


Solution

  • Consider the following:

    DROP TABLE IF EXISTS person;
    
    CREATE TABLE person
    (person_id SERIAL PRIMARY KEY
    ,name VARCHAR(20) NOT NULL UNIQUE
    );
    
    DROP TABLE IF EXISTS movie;
    
    CREATE TABLE movie
    (movie_id SERIAL PRIMARY KEY
    ,title VARCHAR(50) NOT NULL UNIQUE
    );
    
    DROP TABLE IF EXISTS m_cast;
    
    CREATE TABLE m_cast
    (movie_id INT NOT NULL
    ,person_id INT NOT NULL
    ,PRIMARY KEY(movie_id,person_id)
    );
    
    DROP TABLE IF EXISTS m_director;
    
    CREATE TABLE m_director
    (movie_id INT NOT NULL
    ,person_id INT NOT NULL
    ,PRIMARY KEY(movie_id,person_id)
    );
    
    INSERT INTO person (name) VALUES
    ('Steven Feelberg'),
    ('Manly Kubrick'),
    ('Alfred Spatchcock'),
    ('Fred Pitt'),
    ('Raphael DiMaggio'),
    ('Bill Smith');
    
    INSERT INTO movie VALUES
    (1,'Feelberg\'s Movie with Fred & Raph'),
    (2,'Feelberg and Fred Ride Again'),
    (3,'Kubrick shoots DiMaggio'),
    (4,'Kubrick\'s Movie with Bill Smith'),
    (5,'Spatchcock Presents Bill Smith');
    
    INSERT INTO m_director VALUES
    (1,1),
    (2,1),
    (3,2),
    (4,2),
    (5,3);
    
    INSERT INTO m_cast VALUES
    (1,4),
    (1,5),
    (2,4),
    (3,5),
    (4,6),
    (5,6);
    

    I've included the movie table only for ease of reference. It's not relevant to the actual problem. Also, note that this model assumes that cast members are only listed once, regardless of whether or not they have multiple roles in a given film.

    The following query asks 'how often have each actor and director worked together'...

    An actor is any person who has been a cast member of any movie. A director is any person who has been a director of any movie.

    SELECT a.name actor
         , d.name director
         , COUNT(DISTINCT ma.movie_id) total
      FROM person d
      JOIN m_director md 
        ON md.person_id = d.person_id
      JOIN person a
      LEFT
      JOIN m_cast ma 
        ON ma.person_id = a.person_id
       AND ma.movie_id = md.movie_id
      JOIN m_cast x
        ON x.person_id = a.person_id
     GROUP
        BY actor
         , director;
         
    +-------------------+-------------------+-------+
    | actor             | director          | total |
    +-------------------+-------------------+-------+
    | Fred Pitt         | Alfred Spatchcock |     0 |
    | Fred Pitt         | Manly Kubrick     |     0 |
    | Fred Pitt         | Steven Feelberg   |     2 |
    | Raphael DiMaggio  | Alfred Spatchcock |     0 |
    | Raphael DiMaggio  | Manly Kubrick     |     1 |
    | Raphael DiMaggio  | Steven Feelberg   |     1 |
    | Bill Smith        | Alfred Spatchcock |     1 |
    | Bill Smith        | Manly Kubrick     |     1 |
    | Bill Smith        | Steven Feelberg   |     0 |
    +-------------------+-------------------+-------+
    

    By observation, we can see that:

    • the only actor to work more often with Feelberg than any other director is Fred Pritt
    • Raphael DiCaprio and Bill Smith have both worked equally often with two directors (albeit different directors)

    EDIT: While I'm not seriously advocating this as a solution, the following is simply to demonstrate that the kernel provided above is really all you need to solve the problem...

    SELECT x.*
      FROM 
         ( SELECT a.* 
     FROM 
        ( SELECT a.name actor
               , d.name director
               , COUNT(DISTINCT ma.movie_id) total
            FROM person d
            JOIN m_director md 
              ON md.person_id = d.person_id
            JOIN person a
            LEFT
            JOIN m_cast ma 
              ON ma.person_id = a.person_id
             AND ma.movie_id = md.movie_id
            JOIN m_cast x
              ON x.person_id = a.person_id
           GROUP
              BY actor
               , director
        ) a
     LEFT
     JOIN
        ( SELECT a.name actor
               , d.name director
               , COUNT(DISTINCT ma.movie_id) total
            FROM person d
            JOIN m_director md 
              ON md.person_id = d.person_id
            JOIN person a
            LEFT
            JOIN m_cast ma 
              ON ma.person_id = a.person_id
             AND ma.movie_id = md.movie_id
            JOIN m_cast x
              ON x.person_id = a.person_id
           GROUP
              BY actor
               , director
        ) b
       ON b.actor = a.actor
      AND b.director <> a.director 
      AND b.total > a.total
    WHERE b.actor IS NULL
    ) x
    LEFT JOIN
         ( SELECT a.* 
     FROM 
        ( SELECT a.name actor
               , d.name director
               , COUNT(DISTINCT ma.movie_id) total
            FROM person d
            JOIN m_director md 
              ON md.person_id = d.person_id
            JOIN person a
            LEFT
            JOIN m_cast ma 
              ON ma.person_id = a.person_id
             AND ma.movie_id = md.movie_id
            JOIN m_cast x
              ON x.person_id = a.person_id
           GROUP
              BY actor
               , director
        ) a
     LEFT
     JOIN
        ( SELECT a.name actor
               , d.name director
               , COUNT(DISTINCT ma.movie_id) total
            FROM person d
            JOIN m_director md 
              ON md.person_id = d.person_id
            JOIN person a
            LEFT
            JOIN m_cast ma 
              ON ma.person_id = a.person_id
             AND ma.movie_id = md.movie_id
            JOIN m_cast x
              ON x.person_id = a.person_id
           GROUP
              BY actor
               , director
        ) b
       ON b.actor = a.actor
      AND b.director <> a.director 
      AND b.total > a.total
    WHERE b.actor IS NULL
    ) y
    ON y.actor = x.actor AND y.director <> x.director
    WHERE y.actor IS NULL;
    
    +-----------+-----------------+-------+
    | actor     | director        | total |
    +-----------+-----------------+-------+
    | Fred Pitt | Steven Feelberg |     2 |
    +-----------+-----------------+-------+
    

    This returns a list of every actor, and the director with whom they've worked most often. In this case, because Bill Smith and Raphael DiMaggio have worked most often equally with two directors, they are excluded from the result.

    The answer to your problem is simply to select from this list all rows with Yash Chopra listed as the director.