Search code examples
mysqlperformancejoindatabase-performance

sql join order dramatically changes performance


so I have two tables in mysql: articles and articles_rubrics, both with ~20.000 rows

article has multiple cols, but its article_id is indexed.

articles_rubrics has only two cols: article_id and rubrics_id and both are indexed separately and on top of that there is joined index of these two.

My issue is that when I select data from these tables with join, the order is extremely important, which is an issue for me and I don't understand the reason for it:

SELECT article_id,rubric_id FROM articles
LEFT JOIN articles_rubrics USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;

and explain says (for articles_rubrics) this:

time: 0.312 s
key_len: 1
ref: const
rows: 7352
extra: Using where; Using temporary; Using filesort

But when I switch the order of it:

SELECT article_id,rubric_id FROM articles_rubrics
LEFT JOIN articles USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;

and explain says (for articles_rubrics) this:

time: 0.001 s
key_len:9
ref: NULL
rows: 28
extra: Using where; Using index

So I have two tables and this makes their querying go ~300 times slower/faster. How is that even possible?

PS: I've heavily simplified my real world problem for this example, but I stumbled upon this because my

SELECT * FROM articles [LEFT JOIN for 5 other tables]

was taking 1.5s and when I actually added other join to the mix, execution time changed to like 0.006s.

Show index:

show index from articles;
Table   Non_unique  Key_name    Seq_in_index    Column_name Collation   Cardinality Sub_part    Packed  Null    Index_type  Comment Index_comment
articles    0   PRIMARY 1   article_id  A   20043   NULL    NULL        BTREE        
articles    1   article_url_title   1   article_url_title   A   10021   NULL    NULL        BTREE        
articles    1   FULLTEXT    1   article_title   NULL    1   NULL    NULL        FULLTEXT         
articles    1   FULLTEXT    2   article_content NULL    1   NULL    NULL        FULLTEXT         

show index from articles_rubrics;
Table   Non_unique  Key_name    Seq_in_index    Column_name Collation   Cardinality Sub_part    Packed  Null    Index_type
articles_rubrics    0   PRIMARY 1   article_id  A   NULL    NULL    NULL        BTREE
articles_rubrics    0   PRIMARY 2   rubric_id   A   20814   NULL    NULL        BTREE
articles_rubrics    1   rubric_id   1   rubric_id   A   17  NULL    NULL        BTREE
articles_rubrics    1   article_id  1   article_id  A   20814   NULL    NULL        BTREE

Solution

  • These two queries run the same, only difference is in using article_id from articles_rubrics in both.

    -- SELECT article_id,rubric_id FROM articles -- would be slow here
    SELECT ar.article_id,ar.rubric_id FROM articles
    JOIN articles_rubrics ar USING(article_id)
    WHERE rubric_id=1
    ORDER BY article_id DESC
    LIMIT 10;
    
    SELECT ar.article_id,ar.rubric_id FROM articles_rubrics ar
    JOIN articles USING(article_id)
    WHERE rubric_id=1
    ORDER BY article_id DESC
    LIMIT 10;
    

    If I force the sql server to use articles_rubrics table in the result, he correctly decides that the articles aren't actually needed. The server, however, won't do that automatically, even though article_id is used as a key.

    I still don't fully understand why it's happening (or how the optimization algorithm actually works), because in both cases, where rubric_id=1 goes into the articles_rubrics table and in both cases, the selected columns are there already (and join articles for existence is ran again, in both cases).

    However, for some reason, in the first example, the server decides to load all the articles first and only then, he checks each one for the rubric_id.