Search code examples
mysqldatabasesql-order-byquery-optimizationfilesort

Derived table with "order by" uses temporary table and filesort, even though I'm only selecting primary key


There's a forum with the tables: posts, topics, forums, users.

I'm trying to list the last 30 posts with related data from other tables, and the number of posts in the topic where the post is located in.

This is the query I use:

SELECT t.id, t.name, t.permissions, t.author, t.added, COUNT(p2.id) pcount, u2.username pusername, u2.id pauthor, p.added padded, p.id pid, u.username
FROM posts p
INNER JOIN (SELECT id FROM posts ORDER BY id DESC LIMIT 30) tmp ON tmp.id = p.id
INNER JOIN topics t ON t.id = p.topic
INNER JOIN users u ON t.author = u.id
INNER JOIN users u2 ON p.author = u2.id
INNER JOIN posts p2 ON p2.topic = t.id
GROUP BY id, name, permissions, author, added, pusername, pauthor, padded, pid, username

Explain SQL: https://i.sstatic.net/kCb0J.png

If I take out the GROUP BY statement, the filesort and the temporary table disappears, even though it shouldn't change that (I guess).

SELECT t.id, t.name, t.permissions, t.author, t.added, u2.username pusername, u2.id pauthor, p.added padded, p.id pid, u.username
FROM posts p
INNER JOIN (SELECT id FROM posts ORDER BY id DESC LIMIT 30) tmp ON tmp.id = p.id
INNER JOIN topics t ON t.id = p.topic
INNER JOIN users u ON t.author = u.id
INNER JOIN users u2 ON p.author = u2.id
INNER JOIN posts p2 ON p2.topic = t.id

Explain SQL: https://i.sstatic.net/OpGPy.png

Also I have an other query which achieves the same thing, but I have to use LEFT JOINs to avoid the filesort and the temporary table.

SELECT t.id, t.name, t.permissions, t.author, t.added, (SELECT COUNT(*) FROM posts WHERE topic = t.id) as pcount, u2.username as pusername, u2.id as pauthor, p.added as padded, p.id as pid, u.username
FROM posts p
LEFT JOIN topics t ON t.id = p.topic
LEFT JOIN users u ON t.author = u.id
LEFT JOIN users u2 ON p.author = u2.id
ORDER BY p.id DESC LIMIT 30

Explain SQL: https://i.sstatic.net/KXMBK.png

My questions are:

  • Which query is the superior regarding performance (both achieves the same thing)
  • If the first one is better, how can I get rid of the filesort and the temporary table (should I even? or is that okay, and just a side effect of the optimizer?)

Thanks guys!


Solution

  • Your third query is fine, and much simpler than the previous two. However, I'm not sure why you need to use LEFT JOIN, nor why not using INNER JOIN would cause a filesort.

    SELECT t.id, t.name, t.permissions, t.author, t.added, (SELECT COUNT(*) FROM posts WHERE topic = t.id) as pcount, u2.username as pusername, u2.id as pauthor, p.added as padded, p.id as pid, u.username
    FROM posts p
    INNER JOIN topics t ON t.id = p.topic
    INNER JOIN users u ON t.author = u.id
    INNER JOIN users u2 ON p.author = u2.id
    ORDER BY p.id DESC LIMIT 30
    

    The above is a straightforward, simple query for your request.

    If you can provide an sqlfiddle example of the filesort caused by using INNER JOIN instead of LEFT JOIN, then we can investigate that.

    Update after SQLFiddle Provided

    Using your sqlfiddle, I was able to discover some interesting behavior and information. Under various conditions, the filesort would appear, and others would cause it to disappear.

    One of those issues is the sparseness of the users table in sqlfiddle; therefore, I added more entries there, as previously using INNER JOIN would cause no results to be returned.

    Regardless, there are 3 potential fixes, and you will have to apply these to your real data set to determine how many of them you need to apply.

    Option 1

    Change all tables from MyISAM to InnoDB.

    Option 2

    If changing the table type is not possible, or insufficient, add an index to the posts table.

    ALTER TABLE `posts`
    ADD INDEX `id_topic_author_added_i` (`id`,`topic`,`author`,`added`);
    

    Option 3

    If the above two options are unavailable or insufficient, add an index to the users table.

    ALTER TABLE `users`
    ADD INDEX `id_username_i` (`id`,`username`);
    

    Reasoning

    The goal of the indexes and engine changes it to allow the query to make a single trip to the table. Under InnoDB, the clustered primary key should provide exactly the indexes needed for that to occur, based on your query. I am not as familiar with MyISAM, but that was not working in the sqlfiddle, at least.

    I can expand on the "why" these indexes help, if you desire.

    You can also look at my sqlfiddle with all 3 options applied, and see for yourself what happens as you remove each of the options above.

    Update: Why adding these indexes works

    First, lets start with some things from the documentation that we are told will or will not allow an index to be used (if not using an index, you likely will get the filesort instead):

    The following queries use the index to resolve the ORDER BY part:

    SELECT * FROM t1 ORDER BY key_part1,key_part2,... ;

    So that means that we should have the ORDER BY column be the first part of a key (aka index.)

    That's about all it says that applies to this query in terms of what will allow indexes to be used. Now, what will prevent indexes from being useful:

    You are joining many tables, and the columns in the ORDER BY are not all from the first nonconstant table that is used to retrieve rows. (This is the first table in the EXPLAIN output that does not have a const join type.)

    We are joining tables, so we definitely need to consider that one, and how to make sure the posts table is first.

    The key used to fetch the rows is not the same as the one used in the ORDER BY

    Okay, so we need to make sure we are using the same key. How do we do that?

    Well, generally the best response is to create what is called a covering index. That means a single index, which contains all the columns you wish to have in the SELECT statement.

    If you do NOT have a covering index, then what can happen is that the query ends up using an index to find a record, it then uses the primary key that is appended to all indexes to look up the main row (which contains all the columns), and then it has all the column values it needs. However, in doing so, it has performed 2 lookups per row, and that's what the covering index seeks to avoid.

    So, with the Option 2 index that worked above, you can see it is a covering index, so it is possible to get a single lookup to the posts table. Also, because id is first, we satisfy the first condition above. The covering index part, and putting the columns used to join with other tables first (topic and author) we allow the query to make those joins after going to the posts table (at least I think that is what is happening, I'm kinda hand waving on this sentence.) As such, we ensure it is the first in the EXPLAIN, and so avoid the second condition above which would prevent use of the index.

    So that's why the index works.

    Now, the weird thing is, if you are using InnoDB, then the rows are organized around the primary key of each of the tables, in what is called a clustered index. The clustered index is effectively a covering index of all non-TEXT or BLOB columns.

    So, changing the engine type to InnoDB should have been sufficient. As for why it was not, that exceeds my knowledge, and so you'll have to open a new question for that, if you are still curious.