Group By query running too slow to be functional

I have a table with ~16M records. I am trying to query the table to determine - for each unique person - what percentage of their records have a flag thrown on them. There are about 200 unique people. My query is this:

select person, concat((sum(qa_flag)/count(*))*100, '%') 
from myTable 
group by person;

Where qa_flag is a tinyint field with either a 1 or a null. There is an index set on the person field, but not the qa_flag field.

This query will not return in any reasonable time. I have let it run for up to an hour. And even worse than that, I need the query to be dynamic and load a table on a webpage with these statistics.

Can someone show me what I am doing wrong or if I am not doing anything wrong per se, could someone suggest ways of speeding up this query - dramatically.

Thanks.

Solution

This query needs to do either a table-scan or an index-scan regardless, so it has to visit 16M entries either way.

I tested EXPLAIN for your query after creating a table with 16M rows.

mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%') 
    -> from myTable 
    -> group by person;
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
| id | select_type | table   | partitions | type  | possible_keys | key    | key_len | ref  | rows     | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
|  1 | SIMPLE      | myTable | NULL       | index | person        | person | 129     | NULL | 16329623 |   100.00 | NULL  |
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+

This shows an index-scan (type: index) with an estimated "rows" count of ~16M (in the case of an index-scan, this isn't really a count of rows, but index leaf entries).

A possible optimization is to create a new index with both person and qa_flag, to serve as a covering index. That way it can produce the result reading only the index, not touching the table rows.

mysql> alter table mytable add key (person, qa_flag);
Query OK, 0 rows affected (22.11 sec)

mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%') 
    -> from myTable 
    -> group by person;
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
| id | select_type | table   | partitions | type  | possible_keys   | key      | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | myTable | NULL       | index | person,person_2 | person_2 | 131     | NULL | 16329623 |   100.00 | Using index |
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+

This still does an index-scan of ~16M index entries, but it's a slight improvement because of the "Using index" note in the extra field.

I tried executing the query. It finished pretty quickly considering how many index entries it had to examine:

mysql> select person, concat((sum(qa_flag)/count(*))*100, '%')  from myTable  group by person limit 2;
+----------------------------------+------------------------------------------+
| person                           | concat((sum(qa_flag)/count(*))*100, '%') |
+----------------------------------+------------------------------------------+
| 0000023f507999464aa2b78875b7e5d6 | 0.0000%                                  |
...
| fffffe98d0963d27015c198262d97221 | 0.0000%                                  |
+----------------------------------+------------------------------------------+
16777216 rows in set (8.64 sec)

(I generated the person values as bunch of random hashes.)

I'm testing on a Macbook Pro laptop with M1 CPU. I use MySQL 8.0.32 with default configuration. This is not a super high-performance test.

So I guess there is something else hindering your performance. Either you have severely outdated hardware, or else an overloaded server, or your client application is blocking you somehow.

I would recommend that you double-check the load on your database server.

Also using query profiling to get more detail where the query is spending its time. I know you said the query takes over an hour, but you should be able to get a query that finishes by testing on a table with fewer rows.