Search code examples
sqlsqliteimdb

Find all the actors that made more movies with Yash Chopra than any other director using IMDB dataset (Reposted)


The first query q1 gives desired result. But query q2 gives excessive large no of movies for a pair of actor and director.

SELECT * FROM 
   (SELECT  pc.PID AS actorID, pc.Name AS Actor, pd.PID AS directorID,pd.Name AS 
    Director,COUNT(DISTINCT m.MID) count_movie FROM Movie m 
    JOIN 
    M_Cast mc ON m.MID = Trim(mc.MID) 
   JOIN  
   M_Director md ON  m.MID = md.MID
   JOIN  
   Person pc ON Trim(mc.PID) = pc.PID
   JOIN 
   Person pd ON md.PID = pd.PID
   WHERE pd.Name LIKE '%Yash%' GROUP BY pc.Name) AS q1

   JOIN

   (SELECT  pc.PID AS actorID,pc.Name Actor,pd.PID AS directorID,pd.Name AS Director, COUNT(DISTINCT 
   m.MID) count_movie FROM Movie m 
   JOIN 
   M_Cast mc ON  m.MID = TRIM(mc.MID ) 
   JOIN  
   M_Director md ON m.MID = md.MID 
   JOIN 
   Person pc ON TRIM(mc.PID) = pc.PID 
   JOIN 
   Person pd  ON md.PID = pd.PID
   WHERE pd.Name NOT LIKE '%Yash%' GROUP BY pc.PID) AS q2
   ON q1.Actor = q2.Actor ORDER BY q1.count_movie DESC
        actorID     Actor          directorID   Director      count_movie  actorID      Actor      directorID    Director       count_movie
   0    nm0707271   Jagdish Raj    nm0007181    Yash Chopra   11           nm0707271    Jagdish Raj nm0474806   Gulshan Kumar   98
   1    nm0471443   Manmohan       nm0007181    Yash Chopra   10           nm0471443    Manmohan    nm0695153   T. Prakash Rao  39
   2    nm0407002   Iftekhar       nm0007181    Yash Chopra    9           nm0407002    Iftekhar    nm1065099   C.P. Dixit      93
   3    nm0534501   Madan Puri     nm0007181    Yash Chopra    8           nm0534501    Madan Puri  nm0619050   Hiren Nag       94

The movie_count on extreme right seems to be unreasonably high. It should be less than 11 for the first row if it has to match the answer.

Here i have not used the above WHERE condition just to show where my query is wrong. Even if i use the Where condition

where q1.count_movie > q2.count_movie

The result which i get is

        Name            count
    0   Sanjeev Kumar   3
    1   Sanjeev Kumar   3
    2   Sanjeev Kumar   3
    3   Sanjeev Kumar   3
    4   Ashok Kumar     2

The below output is the correct result. It will give 245 rows. I am showing only 6 rows.

                Name            count
       0        Jagdish Raj     11
       1        Manmohan        10
       2        Iftekhar        9
       3        Shashi Kapoor   7
       4        Rakhee Gulzar   5
       5        Waheeda Rehman  5

The tables given are

  • Movie(MID,title,year)
  • M_Cast(MID,PID)
  • Person(PID,Name,Gender)
  • M_Director(MID,PID)

I am new in sql and having a lot of trouble in debugging the code. Kindly help me to understand the logic.


Solution

  • The problem with your second part of the query is that you sum up all movies of all other directors instead of counting movies per director.

    I see two approaches for the task:

    1. Per actor compare Yash Chopra's movie count to all others. (Which would requiere WHERE yash_chopra_count > ANY (...), but SQLite doesn't feature the ANY keyword.) Or compare Yash Chopra's movie count to the maximum movie count of the others. (Which does work in SQLite.)
    2. Rank directors per actor by movie count and see whether the best ranked actor is Yash Chopra. (Which should be done with ROW_COUNT).

    Before we start, let's check, if we got everything right: The data model allows multiple directors per movie. But we can still just count. If an actor made three movies with Yash Chopra and three with I. S. Johar, we don't care whether these are six separate movies or maybe just three where Yash Chopra was I. S. Johar's assistent director. We see in my example for the actor there is another director with at least as many movies as with Yash Chopra, so we don't want to select this actor. The m:n relation is hence no hindrance. It doesn't make a difference.

    Now that we made sure we can simply count movies per actor and director let's see the two approaches:

    Compare Yash Chopra count with maximum other count

    As mentioned, we must aggregate other directors twice (once for the count per director, once for the maximum count), because SQLite lacks the ANY operator. This isn't a big deal though when we use a CTE for readability.

    with director_actor as
    (
      select
        c.pid as pid_actor,
        d.pid as pid_director,
        count(*) as movie_count
      from m_director d
      join m_cast c on c.mid = d.mid
      group by c.pid, d.pid
    )
    select pid_actor, movie_count
    from director_actor
    where pid_director = (select pid from person where name = 'Yash Chopra')
    and movie_count >
    (
      select coalesce(max(movie_count), 0)
      from director_actor other
      where other.pid_actor = director_actor.pid_actor
      and other.pid_director <> director_actor.pid_director
    );
    

    Rank directors per actor by movie count

    What steps must we undertake to get the actors in question?

    1. Count movies per actor and director.
    2. Rank the directors per actor. The director with most movies gets rank #1. If there is a tie, then we rank the other director better then Yash Chopra, so Yash Chopra only get's rank #1, if he has more movies than the other directors.
    3. Keep the actors where the best ranked director is Yash Chopra.

    For the ranking I'd use ROW_NUMBER.

    with ranked as
    (
      select
        c.pid as pid_actor,
        d.pid as pid_director,
        yc.pid as pid_yash_chopra,
        count(*) as movie_count,
        row_number() over (partition by c.pid order by count(*) desc, d.pid = yc.pid) as rn
      from m_director d
      join m_cast c on c.mid = d.mid
      left join person yc on yc.pid = d.pid and name = 'Yash Chopra'
      group by c.pid, d.pid, yc.pid
    )
    select pid_actor, movie_count
    from ranked
    where rn = 1 and pid_director = pid_yash_chopra;