Will limiting the result set in the where clause improve the performance of group by and having?

I have a MySQL table with over 25 million rows. So to prevent bringing down the entire DB, I don’t want to do any queries which will cause temp tables to get created on disk, such as an order by on an unindexed column.

So if the table has the following columns…

 employee_id
 first_name
 last_name
 hire_date
 manager_id

And let’s say I want to see those managers who have 3 or more employees working for them like so (let's assume manager_id has no index)..

select count(*), manager_id from employee group by manager_id  having count(id) > 3

Would it help the performance of this query, if I restricted the result set to only those employees hired after 2016....

select count(*), manager_id from employee where hire_date > ‘2016-01-01’ 
group by manager_id having count(id) > 3

Let's assume hire_date also has no index. Would that additional where clause help?

Solution

Yes, restricting the rows in the WHERE clause means there will be fewer rows to group, and some groups won't even show up because the rows that would be in that group will already be filtered out.

The GROUP BY will probably create a temporary table in your query. But at least it will be a smaller temp table because there will be fewer groups. The way to avoid the temp table is to make a query scan in index order by an index on manager_id.

MySQL will be able to avoid the temp table if it can be assured that it's scanning in order by manager_id, so it can count the rows in each group more easily, assuming that it's scanning each group contiguously. In other words, when it reaches the last row for a given manager_id, it knows there can be no more rows further on for that same manager_id. Therefore it doesn't need to keep a tally of the count per manager_id. It can just output each count per manager_id as it finishes scanning each group of rows.

But you might find that an index on hire_date has greater benefit. If that condition can avoid scanning most of the table by starting at rows where hire_date > '2016-01-01', then the cost of the temp table might be less than the cost of the table-scan.

There's no way to make a query that scans by an index on manager_id and also scans by an index on hire_date. Which strategy is better depends on how many rows in your table matching different conditions.