Search code examples
group-bymariadbquery-optimization

MariaDB group by is slow


I have a scenario whereby I am trying to fetch a user's most recent messages from over 2 million rows and group them by their parent (or thread) id. However, this grouping is causing the query time to be around 1s, about 1000 times slower than without the group by.

Here is the table

CREATE TABLE `msg` (
`msg_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`msg_to` int(10) unsigned NOT NULL,
`msg_from` int(10) unsigned NOT NULL,
`msg` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT current_timestamp(),
`parent` int(10) unsigned NOT NULL,
 PRIMARY KEY (`msg_id`),
 KEY `msg_toIX` (`msg_to`) USING BTREE,
 KEY `msg_fromIX` (`msg_from`) USING BTREE,
 KEY `parentIX` (`parent`) USING BTREE ) 
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Here is my query

SELECT a.msg_id, a.msg_from, a.msg FROM msg a 
JOIN(SELECT MAX(msg_id) maxid FROM msg WHERE msg_to = 23 GROUP BY parent ORDER BY msg_id DESC LIMIT 10) b
ON a.msg_id IN (b.maxid)
ORDER BY a.msg_id DESC LIMIT 10

Explained enter image description here

Is this as good as it gets or should I be getting better performance seeing as I can extract 10 thousand rows in 0.001 with a condition and without the group by clause? Am I going about this the wrong way?

Thanks for your continued guidance and support

UPDATE Thanks everyone who contributed. O.Jones hit the nail on the head down below. Compound Index was the missing link


Solution

  • With a bit of input from everyone I will answer the question fully.

    I initially included the LIMIT in the JOIN sub query as this helped to speed it up. O.Jones advice of compound index on 3 fields helped take the query time from 1s down to 0.04s but then I noticed that the results were omitting a huge range of rows. I re-included the ORDER BY in the sub query which corrected the results but took the query up to 0.2s.

    Switching the order of the index from (msg_to, parent, msg_id) - as per O.Jones' answer - to (parent,msg_to,msg_id) resulted in the new index to be used in the GROUP BY, reducing scanned rows in DERIVED table from 1 million+ down to 61 and query time to 0.000s, also allowing for the omission of ORDER BY and LIMIT in sub query.

    So the correct query in is

    SELECT a.msg_id, a.msg_from, a.msg FROM msg a 
    JOIN(SELECT MAX(msg_id) maxid FROM msg WHERE msg_to = 23 GROUP BY parent) b
    ON a.msg_id = b.maxid
    ORDER BY a.msg_id DESC 
    LIMIT 10
    

    With Index:

    CREATE INDEX parent_to_id ON msg (parent, msg_to, msg_id);
    

    Explained

    Thanks everybody. Truth be told, it was a pretty awesome question which deserved some up votes. I'm sure someone will find it useful in the future :)