mysql sql subquery join correlated-subquery

SQL to find next 5 most lengthy films in each category whose length is greater than the avg of length in their respective category?

From the sakila database available here, I want to find next 5 lengthiest movies in each category. I made this problem. Using the query

SELECT fl.title, fl.length, ct.name
FROM film fl JOIN film_category fc ON fl.film_id=fc.film_id
             JOIN category ct ON ct.category_id=fc.category_id
WHERE fl.length > (SELECT AVG(fl.length) 
                   FROM film fl JOIN film_category fc ON fl.film_id = fc.film_id
                   JOIN category cat ON cat.category_id=fc.category_id
                   WHERE cat.name=ct.name);

I can find all the Films whose length is greater than the average of film in their respective categories. When I add LIMIT 5, then only 5 rows that are retrieved - I need to get 5 from each category. Adding GROUP BY ct.name only gives 16 rows - one from each category.

What query would get this done ?
Would it be possible to achieve this without using any correlated subquery ?
What can help me identify if result of correlated subquery be achieved by joining tables ?

I think it is similar to this, but I couldn't figure out a solution.

Thanks!

Solution

The following has been tested (successfully) against the Sakila database...

SELECT title,
       length,
       categoryName
FROM ( SELECT CASE
                  WHEN film_category.category_id = @currentCategory THEN
                      @currentRecord := @currentRecord + 1
                  ELSE
                      @currentRecord := 1 AND
                      @currentCategory := film_category.category_id
              END AS recordNumber,
              film.title AS title,
              film.length AS length,
              categoryName AS categoryName
       FROM film
       JOIN film_category ON film.film_id = film_category.film_id
       JOIN ( SELECT film_category.category_id AS category_id,
                     category.name AS categoryName,
                     AVG( film.length ) AS categoryAvgLength
              FROM film
              JOIN film_category ON film.film_id = film_category.film_id
              JOIN category ON category.category_id = film_category.category_id
              GROUP BY film_category.category_id,
                       category.name
            ) AS categoryAvgLengthFinder ON film_category.category_id = categoryAvgLengthFinder.category_id,
            (
                SELECT @currentCategory := 0 AS currentCategory
            ) AS currentValuesInitialiser
       WHERE film.length > categoryAvgLength
       ORDER BY film_category.category_id,
                film.length
     ) AS allLarger
WHERE recordNumber <= 5;

This statement starts by using the following subquery to form a "table" consisting of each category's unique identifier category_id, the name of the category and the corresponding average Film length...

SELECT film_category.category_id AS category_id,
       category.name AS categoryName,
       AVG( film.length ) AS categoryAvgLength
FROM film
JOIN film_category ON film.film_id = film_category.film_id
JOIN category ON category.category_id = film_category.category_id
GROUP BY film_category.category_id,
       category.name

This subquery's results are then joined to film and film_catgory. As the subquery retrieves all the details from category that we will need for the rest of the statement, no JOIN with category is needed.

The resulting dataset is then cross-joined with SELECT @currentCategory := 0 AS currentCategory to initialise the variable @currentCategory within the same statement. This does come at the cost of appending a field called currentCategory to the dataset generated above, so you may prefer to use the following code instead...

SET @currentCategory := 0;
SELECT title,
       length,
       categoryName
FROM ( SELECT CASE
                  WHEN film_category.category_id = @currentCategory THEN
                      @currentRecord := @currentRecord + 1
                  ELSE
                      @currentRecord := 1 AND
                      @currentCategory := film_category.category_id
              END AS recordNumber,
              film.title AS title,
              film.length AS length,
              categoryName AS categoryName
       FROM film
       JOIN film_category ON film.film_id = film_category.film_id
       JOIN ( SELECT film_category.category_id AS category_id,
                     category.name AS categoryName,
                     AVG( film.length ) AS categoryAvgLength
              FROM film
              JOIN film_category ON film.film_id = film_category.film_id
              JOIN category ON category.category_id = film_category.category_id
              GROUP BY film_category.category_id,
                       category.name
            ) AS categoryAvgLengthFinder ON film_category.category_id = categoryAvgLengthFinder.category_id
       WHERE film.length > categoryAvgLength
       ORDER BY film_category.category_id,
                film.length
     ) AS allLarger
WHERE recordNumber <= 5;

Once the JOIN's are performed (and @currentCategory is initialised), the resulting dataset is refined to just those records whose value of film.length' is greater than the corresponding average for that category. The refined dataset is then sorted (not grouped) by one of thecategory_idfields (of which there will be two sharing the same value owing to the joining) and subsorted byfilm.length`.

When the fields are chosen, each record's value of category_id is compared to the value of @currentCategory. If they do not match then @currentRecord is initialised to 1 and @currentCategory is updated to the new value of category_id. If they do match, then @currentRecord is incremented. In either case, the value assigned to @currentRecord is returned to the SELECT statement into the field given the alias recordNumber. It is thus that we are able to prepend a record number to our refined dataset.

Then all that remains is to SELECT all the records from the refined dataset (sans record number) where the record number is less than or equal to 5.

Please note that the above achieve their results without the use of any correlated subqueries, instead using JOIN's to tables and subqueries.

If you have any questions or comments, then please feel free to post a Comment accordingly.