I started to investigate why some searches in the Django admin where really slow (see here). Digging further I found that MySQL (5.1, InnoDB tables) performance vary a lot from one query to another one similar. For example:
This query (looking for 'c', 'd' and 'e' in 4 fields, 2 related) generated by Django take 89 ms and return 3093 rows:
SELECT DISTINCT `donnees_artiste`.`id`
FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
ON (T6.`evenement_id` = T7.`id`)
WHERE (
(`donnees_artiste`.`nom` LIKE '%c%'
OR `donnees_artiste`.`prenom` LIKE '%c%'
OR `donnees_evenement`.`cote` LIKE '%c%'
OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
OR `donnees_artiste`.`prenom` LIKE '%d%'
OR T5.`cote` LIKE '%d%'
OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%e%'
OR `donnees_artiste`.`prenom` LIKE '%e%'
OR T7.`cote` LIKE '%e%'
OR T7.`titre` LIKE '%e%' )
);
If I replace the 'e' by a 'k' so it's mostly the same query, it take 8720 ms (100x increase) and return 931 rows.
SELECT DISTINCT `donnees_artiste`.`id`
FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
ON (T6.`evenement_id` = T7.`id`)
WHERE (
(`donnees_artiste`.`nom` LIKE '%c%'
OR `donnees_artiste`.`prenom` LIKE '%c%'
OR `donnees_evenement`.`cote` LIKE '%c%'
OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
OR `donnees_artiste`.`prenom` LIKE '%d%'
OR T5.`cote` LIKE '%d%'
OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%k%'
OR `donnees_artiste`.`prenom` LIKE '%k%'
OR T7.`cote` LIKE '%k%'
OR T7.`titre` LIKE '%k%' )
);
Both of these query give the same EXPLAIN
, so no clue there.
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 SIMPLE donnees_artiste ALL None None None None 4368 Using temporary; Using filesort
1 SIMPLE donnees_artiste_evenements ref artiste_id,donnees_artiste_evenements_eb99df11 artiste_id 4 mmac.donnees_artiste.id 1 Using index; Distinct
1 SIMPLE donnees_evenement eq_ref PRIMARY,donnees_evenements_id_index PRIMARY 4 mmac.donnees_artiste_evenements.evenement_id 1 Using where; Distinct
1 SIMPLE T4 ref artiste_id,donnees_artiste_evenements_eb99df11 artiste_id 4 mmac.donnees_artiste.id 1 Using index; Distinct
1 SIMPLE T5 eq_ref PRIMARY,donnees_evenements_id_index PRIMARY 4 mmac.T4.evenement_id 1 Using where; Distinct
1 SIMPLE T6 ref artiste_id,donnees_artiste_evenements_eb99df11 artiste_id 4 mmac.donnees_artiste.id 1 Using index; Distinct
1 SIMPLE T7 eq_ref PRIMARY,donnees_evenements_id_index PRIMARY 4 mmac.T6.evenement_id 1 Using where; Distinct
Also if I do a COUNT
on the first query it take 11200 ms.
SELECT COUNT(DISTINCT `donnees_artiste`.`id`)
FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T4
ON (`donnees_artiste`.`id` = T4.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T5
ON (T4.`evenement_id` = T5.`id`)
LEFT OUTER JOIN `donnees_artiste_evenements` T6
ON (`donnees_artiste`.`id` = T6.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement` T7
ON (T6.`evenement_id` = T7.`id`)
WHERE (
(`donnees_artiste`.`nom` LIKE '%c%'
OR `donnees_artiste`.`prenom` LIKE '%c%'
OR `donnees_evenement`.`cote` LIKE '%c%'
OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
OR `donnees_artiste`.`prenom` LIKE '%d%'
OR T5.`cote` LIKE '%d%'
OR T5.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%e%'
OR `donnees_artiste`.`prenom` LIKE '%e%'
OR T7.`cote` LIKE '%e%'
OR T7.`titre` LIKE '%e%' )
);
My innodb_buffer_pool_size
is set high. I have indexes on all relevant fields and on primary keys and I already optimized my tables.
So, why the first query is so fast and the 2 others so slow? These 3 queries are just examples. Many time I'm just changing or removing one character from a query and it made big difference on the query time. But I can't see any pattern.
UPDATE
The performance problem definitely come from how Django generate these queries. All these redundant LEFT OUTER JOIN
chained together kill the performance. At the moment it's not totally clear to me if it's a bug in the Django SQL generator, a bug in how the query is built for the search field or if all that work as expected by the Django developers. I'm still investigating but there's, at least, one strange thing in the Django behavior...
If I run this query (that is not necessarily equivalent to the second one, but not far) the results come pretty fast (161 ms, no cache):
SELECT DISTINCT `donnees_artiste`.`id`
FROM `donnees_artiste`
LEFT OUTER JOIN `donnees_artiste_evenements`
ON (`donnees_artiste`.`id` = `donnees_artiste_evenements`.`artiste_id`)
LEFT OUTER JOIN `donnees_evenement`
ON (`donnees_artiste_evenements`.`evenement_id` = `donnees_evenement`.`id`)
WHERE (
(`donnees_artiste`.`nom` LIKE '%c%'
OR `donnees_artiste`.`prenom` LIKE '%c%'
OR `donnees_evenement`.`cote` LIKE '%c%'
OR `donnees_evenement`.`titre` LIKE '%c%' )
AND (`donnees_artiste`.`nom` LIKE '%d%'
OR `donnees_artiste`.`prenom` LIKE '%d%'
OR `donnees_evenement`.`cote` LIKE '%d%'
OR `donnees_evenement`.`titre` LIKE '%d%' )
AND (`donnees_artiste`.`nom` LIKE '%k%'
OR `donnees_artiste`.`prenom` LIKE '%k%'
OR `donnees_evenement`.`cote` LIKE '%k%'
OR `donnees_evenement`.`titre` LIKE '%k%' )
);
SECOND UPDATE
Finally that's not a bug in Django, I'm pretty sure it's the desired behavior. The idea is, on a multi-terms search, the search of the next term is done on the subset return by the previous term so, for the related fields, all the terms don't have to be in the same row to have a match. For this, the DB have to create temporary table with each subsets and scan it. That explain why there can be a lot a variation because if the first term match only a few rows, the temporary table will be small, and search of subsequent term will be fast (because they will be done on a small table). The difference between the two queries is subtle but the Django query can return more matches in general.
I think, the answer is that e
in most cases is located at the beginning of the scanned strings and in the first searched string, allowing to short cirquit the OR conditions, while matches for k
happen in the last conditions and somewhere in the end of the string. And since there are significantly less rows with k
, more strings should be full scanned without any matches.