Search code examples
mysqlquery-optimizationsql-execution-planexplain

Mysql bad execution plan


I'd like some help on understanding the data output by explain on very similar queries, and the enormous impact on performance. I have 2 tables : annonce and geolocalisation. The first one contains rental ads, and the second one the corresponding location. So, we search rentals in a given place. If i use the default plan

EXPLAIN
SELECT a.*, g.label AS geo_label, g.geo_url
FROM annonce a
INNER JOIN geolocalisation g ON a.geolocalisation_id = g.geolocalisation_id 
WHERE a.categorie_id = 1 AND g.gauche >= 151579 AND g.droite <= 151580 
AND couchage >= 2
ORDER BY FIELD(provenance_id, 2, 1), prix DESC, date_modification DESC, annonce_id ASC

I have an execution time of more than 10s.

+----+-------------+-------+------------+--------+---------------------------------+--------------+---------+------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys                   | key          | key_len | ref                          | rows   | filtered | Extra                                              |
+----+-------------+-------+------------+--------+---------------------------------+--------------+---------+------------------------------+--------+----------+----------------------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ref    | geolocalisation_id,categorie_id | categorie_id | 4       | const                        | 502897 |    33.33 | Using index condition; Using where; Using filesort |
|  1 | SIMPLE      | g     | NULL       | eq_ref | PRIMARY,droite,gauche           | PRIMARY      | 4       | vacamax.a.geolocalisation_id |      1 |    25.00 | Using where                                        |
+----+-------------+-------+------------+--------+---------------------------------+--------------+---------+------------------------------+--------+----------+----------------------------------------------------+

If i force the index on geolocalisation to "gauche"

EXPLAIN
SELECT a.*, g.label AS geo_label, g.geo_url
FROM annonce a
INNER JOIN geolocalisation g ON a.geolocalisation_id = g.geolocalisation_id 
WHERE a.categorie_id = 1 AND g.gauche >= 151579 AND g.droite <= 151580 
AND couchage >= 2
ORDER BY FIELD(provenance_id, 2, 1), prix DESC, date_modification DESC, annonce_id ASC

I have an execution time of .1s

+----+-------------+-------+------------+-------+---------------------------------+--------------------+---------+------------------------------+-------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys                   | key                | key_len | ref                          | rows  | filtered | Extra                                                               |
+----+-------------+-------+------------+-------+---------------------------------+--------------------+---------+------------------------------+-------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | g     | NULL       | range | gauche                          | gauche             | 4       | NULL                         | 52785 |    33.33 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | NULL       | ref   | geolocalisation_id,categorie_id | geolocalisation_id | 5       | vacamax.g.geolocalisation_id |    13 |    16.66 | Using where                                                         |
+----+-------------+-------+------------+-------+---------------------------------+--------------------+---------+------------------------------+-------+----------+---------------------------------------------------------------------+

The result is 188 rows. It seems like far too much rows are tested in the first case, but filtering is efficient in the second case : geolocalisation is a filter which should be applied BEFORE the join : 1) you get the places satisfying conditions 2) you find rentals having those places geolocalisation_id by matching the tables. Please enlighten me.


Solution

  • You know that filtering the geolocation is smarter to do before than after, because you know something about your data and your query that MySQL doesn't.

    Specifically, MySQL guesses that it has to look at 502897*1 rows in the first query, and 52785*13=686205 rows for the second query, and decides to use the first one. There are other factors that go into the decision which execution plan to use, but it gives you a rough idea of what MySQL thinks your data looks like. It's far away from reality (188 rows), and it's not too surprising that basing the decision upon such incorrect assumptions led to a bad strategy.

    In fact, even I only know that because you told me, and now can assume, based on column names, that gauche is always smaller than droite, so your condition on g probably describes a very narrow window. But MySQL does not know that, as you did not tell MySQL that, so it cannot take that into consideration. And it also of course doesn't have the ability to base decisions on the meaning of column names.

    Since you have an index on gauge, for a high value (e.g. g.gauge >= your_max_value_in_that_column), MySQL should actually be able to find out that there is only a handful of rows and should use a better execution plan. Otherwise, MySQL is basically clueless. Try varying the window size over a very wide range (e.g. g.gauche >= 100000 AND g.droite <= 200000); MySQL will not show a significantly different number in rows, unless you get close to the limits of your columns (and have an index on them). For some ranges, the first query actually should get faster, as it gets closer to the data distribution MySQL assumes.

    So how can you tell MySQL about your data distribution?

    It might be possible to encode your information as spatial data (a point) and an index on it. Then you can look for points that lie in a 2d rectangle, and MySQL can now understand that this is actually a very small rectangle containing a limited amount of data. It's not required that your data is actually geometric data, just that you can encode it in 2 dimensions.

    Assuming my assumption is correct, you may also be able to use (g.gauche = 151579 or g.gauche = 151580), and MySQL should also be able to understand that this is only a limited amount of data.

    And you can of course just force the index (or use FROM geolocalisation g STRAIGHT_JOIN annonce a). You know something MySQL doesn't, and oftentimes, you cannot tell MySQL otherwise. The disadvantage is that this cannot adept to other situations e.g. if you (occasionally) use larger windows in your query, or gauche <= droite isn't true anymore.