Search code examples
mysqlsqlquery-optimization

Group By query running too slow to be functional


I have a table with ~16M records. I am trying to query the table to determine - for each unique person - what percentage of their records have a flag thrown on them. There are about 200 unique people. My query is this:

select person, concat((sum(qa_flag)/count(*))*100, '%') 
from myTable 
group by person;

Where qa_flag is a tinyint field with either a 1 or a null. There is an index set on the person field, but not the qa_flag field.

This query will not return in any reasonable time. I have let it run for up to an hour. And even worse than that, I need the query to be dynamic and load a table on a webpage with these statistics.

Can someone show me what I am doing wrong or if I am not doing anything wrong per se, could someone suggest ways of speeding up this query - dramatically.

Thanks.


Solution

  • This query needs to do either a table-scan or an index-scan regardless, so it has to visit 16M entries either way.

    I tested EXPLAIN for your query after creating a table with 16M rows.

    mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%') 
        -> from myTable 
        -> group by person;
    +----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
    | id | select_type | table   | partitions | type  | possible_keys | key    | key_len | ref  | rows     | filtered | Extra |
    +----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
    |  1 | SIMPLE      | myTable | NULL       | index | person        | person | 129     | NULL | 16329623 |   100.00 | NULL  |
    +----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
    

    This shows an index-scan (type: index) with an estimated "rows" count of ~16M (in the case of an index-scan, this isn't really a count of rows, but index leaf entries).

    A possible optimization is to create a new index with both person and qa_flag, to serve as a covering index. That way it can produce the result reading only the index, not touching the table rows.

    mysql> alter table mytable add key (person, qa_flag);
    Query OK, 0 rows affected (22.11 sec)
    
    mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%') 
        -> from myTable 
        -> group by person;
    +----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
    | id | select_type | table   | partitions | type  | possible_keys   | key      | key_len | ref  | rows     | filtered | Extra       |
    +----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
    |  1 | SIMPLE      | myTable | NULL       | index | person,person_2 | person_2 | 131     | NULL | 16329623 |   100.00 | Using index |
    +----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
    

    This still does an index-scan of ~16M index entries, but it's a slight improvement because of the "Using index" note in the extra field.

    I tried executing the query. It finished pretty quickly considering how many index entries it had to examine:

    mysql> select person, concat((sum(qa_flag)/count(*))*100, '%')  from myTable  group by person limit 2;
    +----------------------------------+------------------------------------------+
    | person                           | concat((sum(qa_flag)/count(*))*100, '%') |
    +----------------------------------+------------------------------------------+
    | 0000023f507999464aa2b78875b7e5d6 | 0.0000%                                  |
    ...
    | fffffe98d0963d27015c198262d97221 | 0.0000%                                  |
    +----------------------------------+------------------------------------------+
    16777216 rows in set (8.64 sec)
    

    (I generated the person values as bunch of random hashes.)

    I'm testing on a Macbook Pro laptop with M1 CPU. I use MySQL 8.0.32 with default configuration. This is not a super high-performance test.

    So I guess there is something else hindering your performance. Either you have severely outdated hardware, or else an overloaded server, or your client application is blocking you somehow.

    I would recommend that you double-check the load on your database server.

    Also using query profiling to get more detail where the query is spending its time. I know you said the query takes over an hour, but you should be able to get a query that finishes by testing on a table with fewer rows.