Search code examples
mysqlsqlinner-join

How can I get MySQL optimizer to use my index first when JOINING?


I've got a 50M row table, and I'm trying to write a query to export data. A relevant subset of the table structure:

CREATE TABLE `questions` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `asked_at` datetime NOT NULL,
  `site_id` int(11) NOT NULL,
  `account_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `k_site_id_deleted_asked_at` (`site_id`,`deleted`,`asked_at`)
)

and a subset of my SELECT (the real one has... a lot of joins)

SELECT 
    `questions`.`id` AS question_id,
    `sites`.`name` AS site_name, 
    `accounts`.`name` AS account_name 
    FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at) 
    INNER JOIN `sites` ON `sites`.`id` = `questions`.`site_id`
    INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id` 
    WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59' 
    AND `questions`.`site_id` IN (<some_numbers>) 
    AND `questions`.`deleted` = 0 
    ORDER BY `questions`.`asked_at`;

This query is not fast enough; for large numbers of site_ids and dates, it'll take upwards of a minute to run. However, when I take the join on sites out of the query:

SELECT 
    `questions`.`question_number`,
    `accounts`.`name` AS Account 
    FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at) 
    INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id` 
    WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59' 
    AND `questions`.`site_id` IN (<some_numbers>) 
    AND `questions`.`deleted` = 0 
    ORDER BY `questions`.`asked_at`;

it's blindingly fast — I see my first row in ~50ms — even when I have the rest of my crazy fields and joins in there. It's just joining on sites that seems to be causing a problem.

EXPLAIN gives me a hint as to what's going on. When I have my join for sites in there, sites is listed as the first row in the EXPLAIN output. When I take it out, questions is the first row.

I think the query optimizer is looking at the tables, saying "oh hey I can limit on the sites primary key waaaaay better than this questions thing." But it's wrong.

Is there a way I can tell the query engine that I want it to look at the questions first?


Solution

  • Try to filter it first before doing any joins.

    SELECT `questions`.`id` AS question_id,
            `sites`.`name` AS site_name, 
            `accounts`.`name` AS account_name 
    FROM `questions` 
    INNER JOIN
        (SELECT 
            `id`
            FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at)       
            WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59' 
            AND `questions`.`site_id` IN (<some_numbers>) 
            AND `questions`.`deleted` = 0 
            ORDER BY `questions`.`asked_at`) t1 on t1.`id` = `questions`.`id`
    INNER JOIN `sites` ON `sites`.`id` = `questions`.`site_id`
    INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id`