Search code examples
mysqlindexinggroup-byquery-performance

Improving group by performance in mysql on large result sets


We have a large table, let's call it "Data", with ~2 billion rows, data is indexed by date, location, name. Each row has a 'score'.

We also have a table which contains all of the distinct dates in this table.

If I run a query like this:

SELECT  AVG(score)
    FROM  Data d
    WHERE  d.date IN (
        SELECT  today
            FROM  dates dt
            WHERE  dt.today > '2020-01-01'
              AND  dt.today <  '2020-06-01'
              AND  d.location = 'Location1');

This query returns in several seconds. If I then run the same query, but look for the average score grouped by name, the query takes minutes. i.e.

SELECT  d.name, AVG(score)
    FROM  Data d
    WHERE  d.date IN (
        SELECT  today
            FROM  dates dt
            WHERE  dt.today > '2020-01-01'
              AND  dt.today <  '2020-06-01'
              AND  d.location = 'Location1')
            GROUP BY  .d.name;

The number of distinct names is several hundred thousand, any techniques to improve the speed of a query like this?


Solution

  • First of all try this.

     CREATE INDEX data_name_score ON Data (location, date, name, score);
    

    This compound covering index should accelerate your query. For a table your size it will take some time to create. (Run it overnight, maybe?)

    Why does this index improve your query's performance? Think of the index as a sorted list of all the values in all the columns of the index.

    1. MySQL can random-access the index to find the first relevant row ... the first row with the location you chose and the date in the range you mentioned.

    2. It can then step through the index sequentially, without referring back to the table at all, to satisfy the query. name and score are in the index.

    3. As it steps through the index, lo and behold, the index items are in the ideal order to handle the GROUP BY. It encounters all the score values for the name with value a, then all the scores for b, and so forth. There's no need for an internal table with a row for each of your distinct names.

      Notice that if you said MAX(score) instead of AVG(score) your query could be satisfied by a so-called loose index scan. Those are almost miraculously fast, even faster than the tight index scan your query will use.

    Second, simplify your query like this.

    SELECT d.name, AVG(score) avgscore 
      FROM Data d
     WHERE d.location = 'Location1'
       AND d.date >= '2020-01-01'
       AND d.date < '2020-06-01'
     GROUP BY d.name;
    

    MySQL should be able to satisfy your query with a range scan on the index I suggested.

    And, beware, lots of single-column indexes are generally harmful to performance unless they are matched to actual queries you must perform. Several single-column indexes are not the equivalent of a multi-column index.

    As for why your query is slow without the index, you can use EXPLAIN to get MySQL to tell you exactly how it satisfies the query. It probably has to examine a large fraction of the gigarows in your table to filter the ones you want and generate your results.