Search code examples
mysqlgroup-byquery-optimization

Is there a way to hint mysql to use Using index for group-by


I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.

mysql> explain select name, t1.dep_id, salary 
       from emploee t1
       JOIN ( select dep_id, max(salary) msal 
              from emploee 
              group by dep_id
       ) t2
       ON t1.salary=t2.msal and t1.dep_id = t2.dep_id 
       order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table      | type  | possible_keys | key    | key_len | ref               | rows | Extra    |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL   | NULL    | NULL              |    4 | Using temporary; Using filesort |
|  1 | PRIMARY     | t1         | ref   | dep_id        | dep_id | 8       | t2.dep_id,t2.msal |    1 |    |
|  2 | DERIVED     | emploee    | index | NULL          | dep_id | 8       | NULL              |   84 | Using index    |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)

mysql> explain select name, t1.dep_id, salary 
       from emploee t1 
       JOIN (  select dep_id, max(salary) msal 
               from emploee 
               group by dep_id
       ) t2
       ON t1.salary=t2.msal and t1.dep_id = t2.dep_id 
       order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table      | type  | possible_keys | key    | key_len | ref               | rows | Extra    |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL   | NULL    | NULL              |    4 | Using temporary; Using filesort |
|  1 | PRIMARY     | t1         | ref   | dep_id        | dep_id | 8       | t2.dep_id,t2.msal |    3 |    |
|  2 | DERIVED     | emploee    | range | NULL          | dep_id | 4       | NULL              |    9 | Using index for group-by    |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)

As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?

Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?

CREATE TABLE `emploee` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `dep_id` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19    |
+-----------+

Solution

  • Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.

    Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.

    Updating the cardinality can or will occur when:

    The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).

    So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.

    TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.

    Adding: On this previously linked page, we can probably also find why especially dep_id might have this problem:

    small values like 1 or 2 can result in very inaccurate estimates of cardinality

    I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.