Search code examples
sqlpostgresqlaggregate-functionswindow-functions

Find movies with highest number of awards in certain year - code duplication


I am trying to write a query (PostgreSQL) to get "Movies with highest number of awards in year 2012."

I have following tables:

CREATE TABLE Award(
    ID_AWARD bigserial CONSTRAINT Award_pk PRIMARY KEY,
    award_name VARCHAR(90),
    category VARCHAR(90),
    award_year integer,
    CONSTRAINT award_unique UNIQUE (award_name, category, award_year));

CREATE TABLE AwardWinner(
    ID_AWARD integer,
    ID_ACTOR integer,
    ID_MOVIE integer,
    CONSTRAINT AwardWinner_pk PRIMARY KEY (ID_AWARD));

And I written following query, which gives correct results, but there's is quite a lot of code duplication I think.

select * from 
(select id_movie, count(id_movie) as awards 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB
where awards = (select max(count) from 
(select id_movie, count(id_movie) 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB2);

So SUB and SUB2 are exactly the same subquery. Is there a better way to do this?


Solution

  • Get all winning movies

    SELECT id_movie, awards
    FROM  (
       SELECT aw.id_movie, count(*) AS awards
             ,rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
       FROM   award       a
       JOIN   awardwinner aw USING (id_award)
       WHERE  a.award_year = 2012
       GROUP  BY aw.id_movie
       ) sub
    WHERE  rnk = 1;
    

    Major points

    • This should be simpler and faster than suggestions so far. Test with EXPLAIN ANALYZE.

    • There are cases where CTEs are instrumental to avoid code duplication. But not in this time: a subquery does the job just fine and is usually faster.

    • You can run a window function OVER an aggregate function on the same query level. That's why this works:

      rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
      
    • I'd suggest to use explicit column names in the JOIN condition instead of NATURAL JOIN, which is prone to breakage if you later change / add columns to the underlying tables.
      The JOIN condition with USING is almost as short, but doesn't break as easily.

    • Since id_movie cannot be NULL (ruled out by the JOIN condition and also part of the pk) it is shorter ans slightly faster to use count(*) instead. Same result.

    Just one movie

    Shorter and faster, yet, if you only need one winner:

    SELECT aw.id_movie, count(*) AS awards
    FROM   award       a
    JOIN   awardwinner aw USING (id_award)
    WHERE  a.award_year = 2012
    GROUP  BY 1
    ORDER  BY 2 DESC, 1 -- as tie breaker
    LIMIT  1
    

    Using positional references (1, 2) here as shorthand.
    I added id_movie to ORDER BY as tie breaker in case multiple movies should qualify for the win.