Search code examples
mysqlquery-optimizationself-join

How to improve MySQL performance on self join


I'm doing a self-join on a table like this one (the actual table has many more fields):

id pagetitle parent class_key
1 Page1 0 web
2 Page2 1 web
3 Page3 1 xxx
4 Page4 3 web
5 Page5 4 web
6 Page6 0 alt

I need to find records where the class_key does not match the class_key of the parent record (the parent field is the id of another record). Records with a 0 parent have no parent. The parent and ID fields can't be null. And no record has an ID of 0. So I need a result that contains only records 3 and 4, since they have parents and their class_keys don't match their parent's class_key.

This query appears to work:

SELECT 
    c1.id, 
    c1.pagetitle, 
    c1.parent, 
    c1.context_key, 
    c2.context_key as parent_context_key
FROM `site_content` c1 
INNER JOIN site_content c2
ON 
    c1.parent > 0 
    AND c1.parent = c2.id 
    AND c1.context_key <> c2.context_key 
ORDER BY 
    c1.pagetitle

It also appears to work if I remove the first ON condition. When leaving that out, it also works if I reverse the remaining conditions.

It was a surprise to me that the query times in PhpMyAdmin are very close for the three queries (~015 s). I would have thought the first would be the fastest since it eliminates the records with no parent first (it's not). My actual table only has about 20 rows, but the code may run on very large tables.

I thought I'd ask people who know a lot more about this than I do for suggestions on the best form of this query, or a better query that would do the job.


Solution

  • That id column is almost certainly a primary key. So the query planner can easily infer a condition that's redundant with c2.parent >= 0 from the second condition c1.parent = c2.id: it knows any rows that match the second condition also match the first, so the query plans are probably close to identical.

    Beware: query plans can be much different in large tables. The access-cost profile (i/o and cpu) is quite different in larger tables, so the query planner weights various alternative plans differently. (tl;dr it's pointless to try to infer anything about performance from tiny tables.)

    Leave the first condition in there though. It makes your query easier to reason about, and it may come in handy for performance optimization sometime down the line if a new version has query-planner changes. SQL statements and their data can last decades, far longer than the versions of server they run on.