I have a table with ~16M records. I am trying to query the table to determine - for each unique person - what percentage of their records have a flag thrown on them. There are about 200 unique people. My query is this:
select person, concat((sum(qa_flag)/count(*))*100, '%')
from myTable
group by person;
Where qa_flag
is a tinyint field with either a 1 or a null. There is an index set on the person
field, but not the qa_flag
field.
This query will not return in any reasonable time. I have let it run for up to an hour. And even worse than that, I need the query to be dynamic and load a table on a webpage with these statistics.
Can someone show me what I am doing wrong or if I am not doing anything wrong per se, could someone suggest ways of speeding up this query - dramatically.
Thanks.
This query needs to do either a table-scan or an index-scan regardless, so it has to visit 16M entries either way.
I tested EXPLAIN for your query after creating a table with 16M rows.
mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%')
-> from myTable
-> group by person;
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
| 1 | SIMPLE | myTable | NULL | index | person | person | 129 | NULL | 16329623 | 100.00 | NULL |
+----+-------------+---------+------------+-------+---------------+--------+---------+------+----------+----------+-------+
This shows an index-scan (type: index
) with an estimated "rows" count of ~16M (in the case of an index-scan, this isn't really a count of rows, but index leaf entries).
A possible optimization is to create a new index with both person
and qa_flag
, to serve as a covering index. That way it can produce the result reading only the index, not touching the table rows.
mysql> alter table mytable add key (person, qa_flag);
Query OK, 0 rows affected (22.11 sec)
mysql> explain select person, concat((sum(qa_flag)/count(*))*100, '%')
-> from myTable
-> group by person;
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | myTable | NULL | index | person,person_2 | person_2 | 131 | NULL | 16329623 | 100.00 | Using index |
+----+-------------+---------+------------+-------+-----------------+----------+---------+------+----------+----------+-------------+
This still does an index-scan of ~16M index entries, but it's a slight improvement because of the "Using index" note in the extra field.
I tried executing the query. It finished pretty quickly considering how many index entries it had to examine:
mysql> select person, concat((sum(qa_flag)/count(*))*100, '%') from myTable group by person limit 2;
+----------------------------------+------------------------------------------+
| person | concat((sum(qa_flag)/count(*))*100, '%') |
+----------------------------------+------------------------------------------+
| 0000023f507999464aa2b78875b7e5d6 | 0.0000% |
...
| fffffe98d0963d27015c198262d97221 | 0.0000% |
+----------------------------------+------------------------------------------+
16777216 rows in set (8.64 sec)
(I generated the person values as bunch of random hashes.)
I'm testing on a Macbook Pro laptop with M1 CPU. I use MySQL 8.0.32 with default configuration. This is not a super high-performance test.
So I guess there is something else hindering your performance. Either you have severely outdated hardware, or else an overloaded server, or your client application is blocking you somehow.
I would recommend that you double-check the load on your database server.
Also using query profiling to get more detail where the query is spending its time. I know you said the query takes over an hour, but you should be able to get a query that finishes by testing on a table with fewer rows.