We have a large table, let's call it "Data", with ~2 billion rows, data is indexed by date, location, name. Each row has a 'score'.
We also have a table which contains all of the distinct dates in this table.
If I run a query like this:
SELECT AVG(score)
FROM Data d
WHERE d.date IN (
SELECT today
FROM dates dt
WHERE dt.today > '2020-01-01'
AND dt.today < '2020-06-01'
AND d.location = 'Location1');
This query returns in several seconds. If I then run the same query, but look for the average score grouped by name, the query takes minutes. i.e.
SELECT d.name, AVG(score)
FROM Data d
WHERE d.date IN (
SELECT today
FROM dates dt
WHERE dt.today > '2020-01-01'
AND dt.today < '2020-06-01'
AND d.location = 'Location1')
GROUP BY .d.name;
The number of distinct names is several hundred thousand, any techniques to improve the speed of a query like this?
First of all try this.
CREATE INDEX data_name_score ON Data (location, date, name, score);
This compound covering index should accelerate your query. For a table your size it will take some time to create. (Run it overnight, maybe?)
Why does this index improve your query's performance? Think of the index as a sorted list of all the values in all the columns of the index.
MySQL can random-access the index to find the first relevant row ... the first row with the location
you chose and the date
in the range you mentioned.
It can then step through the index sequentially, without referring back to the table at all, to satisfy the query. name
and score
are in the index.
As it steps through the index, lo and behold, the index items are in the ideal order to handle the GROUP BY
. It encounters all the score
values for the name
with value a
, then all the scores for b
, and so forth. There's no need for an internal table with a row for each of your distinct names.
Notice that if you said MAX(score)
instead of AVG(score)
your query could be satisfied by a so-called loose index scan. Those are almost miraculously fast, even faster than the tight index scan your query will use.
Second, simplify your query like this.
SELECT d.name, AVG(score) avgscore
FROM Data d
WHERE d.location = 'Location1'
AND d.date >= '2020-01-01'
AND d.date < '2020-06-01'
GROUP BY d.name;
MySQL should be able to satisfy your query with a range scan on the index I suggested.
And, beware, lots of single-column indexes are generally harmful to performance unless they are matched to actual queries you must perform. Several single-column indexes are not the equivalent of a multi-column index.
As for why your query is slow without the index, you can use EXPLAIN
to get MySQL to tell you exactly how it satisfies the query. It probably has to examine a large fraction of the gigarows in your table to filter the ones you want and generate your results.