This question also exist here: Poor whereHas performance in Laravel ... but without answer.
A similar situation happened to me as it happened to the author of that question:
replays
table has 4M rows players
table has 40M rowsThis query uses where exists
and it takes a lot of time (70s) to finish:
select * from `replays`
where exists (
select * from `players`
where `replays`.`id` = `players`.`replay_id`
and `battletag_name` = 'test')
order by `id` asc
limit 100;
but when it's changed to use where id in
instead of where exists
- it's much faster (0.4s):
select * from `replays`
where id in (
select replay_id from `players`
where `battletag_name` = 'test')
order by `id` asc
limit 100;
MySQL (InnoDB) is being used.
I would like to understand why there is such a big difference in performance between where exists
VS where id in
- is it because of the way how MySQL works? I expected that the "exists" variant would be faster because MySQL would just check whether relevant rows exist... but I was wrong (I probably don't understand how "exists" works in this case).
Gordon has a good answer. The fact is that performance depends on a lot of different factors including database design/schema and volume of data.
As a rough guide, the exists
sub-query is going to execute once for every row in replays
and the in
sub-query is going to execute once to get the results of the sub-query and then those results will be searched for every row in replays
.
So with the exists
, the better the indexing/access path the faster it will run. Without relevant index(es) it will just read through all rows until it finds a match. For every single row in replays
. For the rows with no matches it would end up reading the entire players
table each time. Even the rows with matches could read through a significant number of players
before finding a match.
With the in
the smaller the resultset from the sub-query the faster it will run. For those without a match it only needs to quickly check the small sub query rows to reach that answer. That said you don't get the benefit of indexes (if it works this way) so for a large result set from the sub query it has to read every row in the sub select before deciding that when there is no match.
That said, database optimisers are pretty clever, and don't always evaluate queries exactly the way you ask them to, hence why checking execution plans and testing yourself is important to figure out the best approach. Its not unusual to expect a certain execution path only to find that optimiser has chosen a different method of execution based on how it expects the data to look.