php mysql optimization query-optimization greatest-n-per-group

How Could this Mysql Query be Improved?

This query is trying to do something that mysql doesn't do easily, which is to limit the number of rows per group. A list of user_id's is passed to the query, and a few pieces of returned but the group needs to be limited to 4 rows per group. The query works, but is somewhat slow 200-500ms according to Sequel Pro.

CONTINUE READING before flagging please!!

SELECT id, user_id, article_id, row_number
FROM (
    SELECT a2.id, a2.user_id, a2.post_id,
        @num:= if(@group = a2.user_id, @num + 1, 1) as row_number
    FROM (
        SELECT a1.id, a1.user_id, a1.post_id
        FROM articles as a1
        WHERE a1.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,38,39,13,114,1111,12,223,2234,225,226,227,228,229,2210)
        ORDER BY a1.date DESC
    ) as a2, 
    (SELECT @num := 0) t
) as f
WHERE row_number <= 4;

The EXPLAIN for this query is:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   PRIMARY <derived2>  ALL         NULL    NULL    NULL    NULL    10516   Using where
2   DERIVED <derived4>  system      NULL    NULL    NULL    NULL    1   
2   DERIVED <derived3>  ALL         NULL    NULL    NULL    NULL    10516   
4   DERIVED NULL        NULL        NULL    NULL    NULL    NULL    NULL    No tables used
3   DERIVED s1          ALL         Reco... NULL    NULL    NULL    1180931 Using filesort

I thought about breaking this up into multiple queries but I still seem to be running into the issue of limiting each group result to 4. All in all I'm trying to avoid a lot of queries & expensive queries.

Any ideas on the best way to go about improving this query's speed by breaking it up and moving some of it into the application?

Solution

To answer your question, I don't see any effective way to "break up" this query. You would still need to figure out whether articles from that one user_id (@group)are contiguous by date, with no intervening posts from one of the other user_ids. And having all the rows sorted together, by date, is going to be the best way to do that.

If the number of rows that are being eliminated is a large subset of the rows, then filtering these on the client side would require sending a larger resultset to the client. But if it's a small fraction of rows being filtered out, then that makes the transfer of all the rows (for all of the users in the list) to the client for processing more appealing.

SELECT a.id
     , a.user_id
     , a.post_id
  FROM articles a
 WHERE a.user_id IN (3,14,1,2,3,4,5,6,7,8,9,10,11,12,...)
 ORDER BY a.date DESC

Then the client could fetch through the rows, checking for contiguous sequences of rows for that single user_id (@group), and just ignoring the fifth, sixth, etc. rows until a row with a different user_id is found.

If the specifications for the resultset were different, it might be possible to break it up. But the way the query is written now, the resultsets from any "broken up" queries would need to be combined, in order to get the same resultset currently returned by the current query.

(This query is SIGNIFICANTLY different from the query in the question marked by Marc B as a possible duplicate.)

This is an odd resultset; we don't see anywhere that @group is being assigned a value in the statement, so presumably that is being set prior to the execution of this statement. So, the expression

@group = a2.user_id

tests whether the user_id is equal to a constant. That means that the query is identifying rows from articles that are posted by a single user_id, and incrementing the row_number whenever that user posts two (or more) articles in succession, with no intervening article posted by any other user_id in the IN list (as order by the DATE column). An article posted by another user_id (in the IN list), will reset the counter to 1.

The net effect is that this query is returning all articles from ALL users specified in the IN list EXCEPT for a single user_id (which may or may not be in the list.) Whenever there are five or more articles posted contiguously by that one single constant user_id, with no intervening articles from another user_id in the IN list... whenever that happens, the query keeps only the first four (latest four) rows of contiguous articles from that one specified user_id.

If the date column is DATE datatype, with no time component, it's much more likely that you are going to have multiple rows with the same date. And there is no ordering specified beyond the date column, so the resultset is indeterminate. (That is, there can be multiple sequences of the same set of rows which will satisfy the ORDER BY.) It's also indeterminate with a DATETIME, but if most of those values include unique-ish time components (i.e. other than a constant such as midnight), then that is less likely an issue.

What's odd about it, is that the same set of rows could be ordered two ways, and give different results. Assuming @group identifies user 'abc':

Date       user   id        Date       user   id
---------- ------ --        ---------- ------ --
2103-07-22 abc     1        2103-07-22 abc     1
2103-07-22 abc     2        2103-07-22 abc     2
2103-07-22 abc     3        2103-07-22 abc     3
2103-07-22 EFGHI   4        2103-07-22 abc     5
2103-07-22 abc     5        2103-07-22 abc     6
2103-07-22 abc     6        2103-07-22 abc     7
2103-07-22 abc     7        2103-07-22 EFGHI   4

7 rows selected.            5 rows selected.

Both resultsets are consistent with the specification, so either could be returned.

There's nothing wrong with returning a resultset like that. It's just a bit odd.

In terms of performance, an index with a leading column of (user_id) might be suitable for the predicate in the WHERE clause, if that is eliminating a large percentage of rows.

Or, an index with leading columns of (date,user_id) might be more suitable, in that MySQL could avoid a "Using filesort" operation, and retrieve the rows in descending date order, and then filter out the rows with the predicate on user_id as the rows are accessed.

Actually, a covering index on columns (date, user_id, post_id, id) might be even more beneficial.