I have a scenario whereby I am trying to fetch a user's most recent messages from over 2 million rows and group them by their parent (or thread) id. However, this grouping is causing the query time to be around 1s, about 1000 times slower than without the group by.
Here is the table
CREATE TABLE `msg` (
`msg_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`msg_to` int(10) unsigned NOT NULL,
`msg_from` int(10) unsigned NOT NULL,
`msg` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT current_timestamp(),
`parent` int(10) unsigned NOT NULL,
PRIMARY KEY (`msg_id`),
KEY `msg_toIX` (`msg_to`) USING BTREE,
KEY `msg_fromIX` (`msg_from`) USING BTREE,
KEY `parentIX` (`parent`) USING BTREE )
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Here is my query
SELECT a.msg_id, a.msg_from, a.msg FROM msg a
JOIN(SELECT MAX(msg_id) maxid FROM msg WHERE msg_to = 23 GROUP BY parent ORDER BY msg_id DESC LIMIT 10) b
ON a.msg_id IN (b.maxid)
ORDER BY a.msg_id DESC LIMIT 10
Is this as good as it gets or should I be getting better performance seeing as I can extract 10 thousand rows in 0.001 with a condition and without the group by clause? Am I going about this the wrong way?
Thanks for your continued guidance and support
UPDATE Thanks everyone who contributed. O.Jones hit the nail on the head down below. Compound Index was the missing link
With a bit of input from everyone I will answer the question fully.
I initially included the LIMIT in the JOIN sub query as this helped to speed it up. O.Jones advice of compound index on 3 fields helped take the query time from 1s down to 0.04s but then I noticed that the results were omitting a huge range of rows. I re-included the ORDER BY in the sub query which corrected the results but took the query up to 0.2s.
Switching the order of the index from (msg_to, parent, msg_id) - as per O.Jones' answer - to (parent,msg_to,msg_id) resulted in the new index to be used in the GROUP BY, reducing scanned rows in DERIVED table from 1 million+ down to 61 and query time to 0.000s, also allowing for the omission of ORDER BY and LIMIT in sub query.
So the correct query in is
SELECT a.msg_id, a.msg_from, a.msg FROM msg a
JOIN(SELECT MAX(msg_id) maxid FROM msg WHERE msg_to = 23 GROUP BY parent) b
ON a.msg_id = b.maxid
ORDER BY a.msg_id DESC
LIMIT 10
With Index:
CREATE INDEX parent_to_id ON msg (parent, msg_to, msg_id);
Thanks everybody. Truth be told, it was a pretty awesome question which deserved some up votes. I'm sure someone will find it useful in the future :)