Search code examples
mysqlsqlgreatest-n-per-group

Return top N rows per group in MySQL, but efficiently


I have a pretty simple table in MySQL 5.7.30, which I boiled down to the three columns below. I'm trying to determine top N elements per group for some groups (WHERE groupable IN (3, 4, 5)). But I cannot do it efficiently even for a single group (see WHERE groupable = 3 below).

DROP TABLE IF EXISTS test;
CREATE TABLE test (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    groupable BIGINT NOT NULL,
    orderable BIGINT NOT NULL,
    UNIQUE INDEX test_index_1 (groupable, orderable),
    UNIQUE INDEX test_index_2 (orderable, groupable),
    INDEX test_index_3 (orderable),
    INDEX test_index_4 (groupable)
);
INSERT INTO test(groupable, orderable) VALUES
    (1, 100), (1, 101), (1, 102), (1, 103), (1, 104), (1, 105), (1, 106), (1, 107),
    (2, 200), (2, 201), (2, 202), (2, 203), (2, 204), (2, 205), (2, 206), (2, 207),
    (3, 300), (3, 301), (3, 302), (3, 303), (3, 304), (3, 305), (3, 306), (3, 307),
    (4, 400);


EXPLAIN SELECT id FROM test
WHERE groupable = 3
ORDER BY orderable LIMIT 2;

The final EXPLAIN returns the rows value of 8. According to the documentation, "the rows column indicates the number of rows MySQL believes it must examine to execute the query." I was hoping that having a (groupable, orderable) index would alleviate the need to examine every row with groupable = 3 and allow the engine to access the largest ones directly. Is that not the case? Is there a way around that?

I see people ask this question all the time, but all the answers I've seen so far seem to have the same downside: examining every row per group. Or for those that don't have a WHERE/IN clause, examining the whole table.

Thanks for your help!

Note: while this example is small, I've reproduced the same on a table with thousands of groupables and hundreds of rows for each groupable.

Note #2: I've added extra indexes just in case, to make sure I'm not missing some hidden optimisation.


Solution

  • The composite index that includes the grouping and ordering column will fully cover this query. Additionally, mysql will stop reading the index as soon as it finds the number of results specified in the LIMIT.

    In this way, the query will not examine all the rows when it actually runs. The EXPLAIN clause is an approximation and does not include this short-circuit LIMIT optimization in its estimation for ROWS examined.

    From the docs... https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html

    MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast

    https://dev.mysql.com/doc/refman/5.7/en/explain-output.html

    Using index - The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.