I've got a 50M row table, and I'm trying to write a query to export data. A relevant subset of the table structure:
CREATE TABLE `questions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`asked_at` datetime NOT NULL,
`site_id` int(11) NOT NULL,
`account_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `k_site_id_deleted_asked_at` (`site_id`,`deleted`,`asked_at`)
)
and a subset of my SELECT (the real one has... a lot of joins)
SELECT
`questions`.`id` AS question_id,
`sites`.`name` AS site_name,
`accounts`.`name` AS account_name
FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at)
INNER JOIN `sites` ON `sites`.`id` = `questions`.`site_id`
INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id`
WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59'
AND `questions`.`site_id` IN (<some_numbers>)
AND `questions`.`deleted` = 0
ORDER BY `questions`.`asked_at`;
This query is not fast enough; for large numbers of site_ids and dates, it'll take upwards of a minute to run. However, when I take the join on sites
out of the query:
SELECT
`questions`.`question_number`,
`accounts`.`name` AS Account
FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at)
INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id`
WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59'
AND `questions`.`site_id` IN (<some_numbers>)
AND `questions`.`deleted` = 0
ORDER BY `questions`.`asked_at`;
it's blindingly fast — I see my first row in ~50ms — even when I have the rest of my crazy fields and joins in there. It's just joining on sites
that seems to be causing a problem.
EXPLAIN
gives me a hint as to what's going on. When I have my join for sites
in there, sites
is listed as the first row in the EXPLAIN
output. When I take it out, questions
is the first row.
I think the query optimizer is looking at the tables, saying "oh hey I can limit on the sites
primary key waaaaay better than this questions
thing." But it's wrong.
Is there a way I can tell the query engine that I want it to look at the questions
first?
Try to filter it first before doing any joins.
SELECT `questions`.`id` AS question_id,
`sites`.`name` AS site_name,
`accounts`.`name` AS account_name
FROM `questions`
INNER JOIN
(SELECT
`id`
FROM `questions` FORCE INDEX (k_site_id_deleted_asked_at)
WHERE `questions`.`asked_at` BETWEEN '2017-07-06 07:00:00' AND '2017-08-07 06:59:59'
AND `questions`.`site_id` IN (<some_numbers>)
AND `questions`.`deleted` = 0
ORDER BY `questions`.`asked_at`) t1 on t1.`id` = `questions`.`id`
INNER JOIN `sites` ON `sites`.`id` = `questions`.`site_id`
INNER JOIN `accounts` ON `questions`.`account_id` = `accounts`.`id`