Search code examples
mysqlsql

Optimizing summing/grouping query with millions of rows on MYSQL


I have a MySQL table with nearly 4.000.000 rows containing income transactions of more than 100.000 employees.

There are three columns relevant in it, which are:

  • Employee ID [VARCHAR and INDEX] (not unique since one employee gets more than one income);
  • Type of Income [also VARCHAR and INDEX]
  • Value of the Income [Decimal; 10,2]

What I was looking to do seems to be very simple to me. I wanted to sum all the income occurrences grouping by each employee, filtering by one type. For that, I was using the following code:

SELECT 
    SUM(`value`) AS `SumofValue`,
    `type`,
    `EmployeeID`
FROM
    `Revenue`
GROUP BY `EmployeeID`
HAVING `type` = 'X'

And the result was supposed to be something like this:

  SUM        TYPE     EMPLOYEE ID
 R$ 250,00  X   250000008377
 R$ 5.000,00    X   250000004321
 R$ 3.200,00    X   250000005432
 R$ 1.600,00    X   250000008765
....

However, this is taking a long time. I decide to use the LIMIT command to limit the results just to 1.000 rows and it is working, but if i want to do for the whole table, it would take approximately 1 hous according to my projections. This seems to be way too much time for something that does not look sooooo demandable to me (but i'm assuming i'm probably wrong). Not only that, but this is just the first step on an even more complex query that i intend to run in the future, in which i will group also by Employer ID, aside with Employee ID (one person can get income from more than one employer).

Is there any way to optimize this? Is there anything wrong with my code? Is there any secret path to increase the speed of this operation? Should I index the column of the value of the income as well? If this is a MySQL limitation, is there any option that could handle this better? I would really appreaciate any help.

Thanks in advance

DISCLOSURE: This is a open government database. All this data is lawfully open to the public.


Solution

  • First, phrase the query using WHERE, rather than HAVING -- filter before doing the aggregation:

    SELECT SUM(`value`) AS `SumofValue`,
           MAX(type) as type,
           EmployeeID
    FROM Revenue r
    WHERE `type` = 'X'
    GROUP BY EmployeeID;
    

    Next, try using this index: (type, EmployeeId, value). At the very least, this is a covering index for the query. MySQL (depending on the version) might be smart enough to use it for the aggregation as well.