Search code examples
mysqljoinsubqueryquery-performance

Why is subquery join much faster than direct join


I have 2 tables (pages and comments), around 130 000 rows each.

I want to list pages without any comments (foreign key is comments.page_id)

If I execute the normal left outer join, it takes an amazing more than 750 seconds to run. (130k^2 = 17B). Whereas if I execute the same join, but using subqueries for the tables, it takes just 1 second.

Server version: 5.6.44-log - MySQL Community Server (GPL):

Query 1. Normal join, 750+ seconds

SELECT p.id
FROM `pages` AS p
LEFT JOIN  `comments` AS c
    ON p.id = c.page_id
WHERE c.page_id IS NULL
GROUP BY 1

Query 2. Join with first table as subquery, Too much time

SELECT p.id
FROM (
    SELECT id FROM `pages`
) AS p
LEFT JOIN  `comments` AS c
    ON p.id = c.page_id
WHERE c.page_id IS NULL
GROUP BY 1

Query 3. Join with second table as subquery, 1.6 seconds

SELECT p.id
FROM `pages` AS p
LEFT JOIN (
   SELECT * FROM `comments`
) AS c
    ON p.id = c.page_id
WHERE c.page_id IS NULL
GROUP BY 1

Query 4. Join with 2 subqueries, 1 second

SELECT p.id
FROM (
    SELECT id FROM `pages`
) AS p
LEFT JOIN (
   SELECT * FROM `comments`
) AS c
    ON p.id = c.page_id
WHERE c.page_id IS NULL
GROUP BY 1

Query 5. Join with 2 subqueries, selecting only 1 column, 0.2 seconds

SELECT p.id
FROM (
    SELECT id FROM `pages`
) AS p
LEFT JOIN (
   SELECT page_id FROM `comments`
) AS c
    ON p.id = c.page_id
WHERE c.page_id IS NULL
GROUP BY 1

Query 6. Too much time

SELECT p.id
    FROM `pages` AS p
    WHERE NOT EXISTS( SELECT page_id FROM `comments`
                        WHERE page_id = p.id );;

Now, in MySql version 5.7, all of the above queries take "too much time" to execute.

In MySql 5.7, query 1 and 4 have same explanation:

id  select_type  table    partitions     type    possible_keys  key         key_len  ref    rows        filtered    Extra  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1    SIMPLE         p       NULL        index       PRIMARY    PRIMARY      4       NULL    147626      100.00      Using index; Using temporary; Using filesort  
1    SIMPLE         c       NULL        ALL         NULL        NULL        NULL    NULL    147790      10.00       Using where; Not exists; Using join buffer (Block Nested Loop)

In MySql 5.6, unfortunately I cannot get the explanation for query 1 right now (taking too much time), but for query 4 is the below:

id  select_type table       type    possible_keys   key     key_len     ref     rows        Extra   
---------------------------------------------------------------------------------------------------------------------------
1   PRIMARY     <derived2>  ALL     NULL            NULL        NULL    NULL    147626      Using temporary; Using filesort 
1   PRIMARY     <derived3>  ref     <auto_key0>     <auto_key0>  4      p.id    10          Using where; Not exists    
3   DERIVED     comments    ALL     NULL            NULL        NULL    NULL    147790      NULL   
2   DERIVED     pages       index   NULL            PRIMARY     4       NULL    147626      Using index

Tables:

CREATE TABLE `pages` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `identifier` varchar(250) NOT NULL DEFAULT '',
 `reference` varchar(250) NOT NULL DEFAULT '',
 `url` varchar(1000) NOT NULL DEFAULT '',
 `moderate` varchar(250) NOT NULL DEFAULT 'default',
 `is_form_enabled` tinyint(1) unsigned NOT NULL DEFAULT '1',
 `date_modified` datetime NOT NULL,
 `date_added` datetime NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147627 DEFAULT CHARSET=utf8


CREATE TABLE `comments` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `user_id` int(10) unsigned NOT NULL DEFAULT '0',
 `page_id` int(10) unsigned NOT NULL DEFAULT '0',
 `website` varchar(250) NOT NULL DEFAULT '',
 `town` varchar(250) NOT NULL DEFAULT '',
 `state_id` int(10) NOT NULL DEFAULT '0',
 `country_id` int(10) NOT NULL DEFAULT '0',
 `rating` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `reply_to` int(10) unsigned NOT NULL DEFAULT '0',
 `comment` text NOT NULL,
 `reply` text NOT NULL,
 `ip_address` varchar(250) NOT NULL DEFAULT '',
 `is_approved` tinyint(1) unsigned NOT NULL DEFAULT '1',
 `notes` text NOT NULL,
 `is_admin` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `is_sent` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `sent_to` int(10) unsigned NOT NULL DEFAULT '0',
 `likes` int(10) unsigned NOT NULL DEFAULT '0',
 `dislikes` int(10) unsigned NOT NULL DEFAULT '0',
 `reports` int(10) unsigned NOT NULL DEFAULT '0',
 `is_sticky` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `is_locked` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `is_verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
 `date_modified` datetime NOT NULL,
 `date_added` datetime NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147879 DEFAULT CHARSET=utf8

Questions

  1. Why is this happening? What does MySql do under the hood?

  2. Does this happen only in MySql, or any other Sql as well?

  3. How can I write a fast query to get what I need? (In both v 5.6, 5.7)


Solution

  • The problem with your long-running queries, is that you lack an index on the page_id column of the comments table. Hence, for each row from the pages table, you need to check all rows of the comments table. Since you are using LEFT JOIN, this is the only possible join order. What happens in 5.6, is that when you use a subquery in the FROM clause (aka derived table), MySQL will create an index on the temporary table used for the result of the derived table (auto_key0 in the EXPLAIN output). The reason it is faster when you only select one column, is that the temporary table will be smaller.

    In MySQL 5.7, such derived tables will be automatically merge into the main query, if possible. This is done to avoid the extra temporary tables. However, this means that you no longer have an index to use for the join. (See this blog post for details.)

    You have two options to improve the query time in 5.7:

    1. You can create an index on comments(page_id)
    2. You can prevent the subquery from being merged by rewriting it to a query that can not be merged. Subqueries with aggregation, LIMIT, or UNION will not be merged (see the blog post for details). One way to do this is to add a LIMIT clause to the subquery. In order not to remove any rows from the result, the limit must be larger than the number of rows in the table.

    In MySQL 8.0, you can also use an optimizer hint to avoid the merging. In your case, that would be something like

    SELECT /*+ NO_MERGE(c) */ ... FROM
    

    See slides 34-37 of this presentation for examples of how to use such hints.