MySQL: where exists VS where id in [performance]

This question also exist here: Poor whereHas performance in Laravel ... but without answer.

A similar situation happened to me as it happened to the author of that question:

replays table has 4M rows
players table has 40M rows

This query uses where exists and it takes a lot of time (70s) to finish:

select * from `replays` 
where exists (
    select * from `players` 
    where `replays`.`id` = `players`.`replay_id` 
      and `battletag_name` = 'test') 
order by `id` asc 
limit 100;

but when it's changed to use where id in instead of where exists - it's much faster (0.4s):

select * from `replays` 
where id in (
    select replay_id from `players` 
    where `battletag_name` = 'test') 
order by `id` asc 
limit 100;

MySQL (InnoDB) is being used.

I would like to understand why there is such a big difference in performance between where exists VS where id in - is it because of the way how MySQL works? I expected that the "exists" variant would be faster because MySQL would just check whether relevant rows exist... but I was wrong (I probably don't understand how "exists" works in this case).

Solution

Gordon has a good answer. The fact is that performance depends on a lot of different factors including database design/schema and volume of data.

As a rough guide, the exists sub-query is going to execute once for every row in replays and the in sub-query is going to execute once to get the results of the sub-query and then those results will be searched for every row in replays.

So with the exists, the better the indexing/access path the faster it will run. Without relevant index(es) it will just read through all rows until it finds a match. For every single row in replays. For the rows with no matches it would end up reading the entire players table each time. Even the rows with matches could read through a significant number of players before finding a match.

With the in the smaller the resultset from the sub-query the faster it will run. For those without a match it only needs to quickly check the small sub query rows to reach that answer. That said you don't get the benefit of indexes (if it works this way) so for a large result set from the sub query it has to read every row in the sub select before deciding that when there is no match.

That said, database optimisers are pretty clever, and don't always evaluate queries exactly the way you ask them to, hence why checking execution plans and testing yourself is important to figure out the best approach. Its not unusual to expect a certain execution path only to find that optimiser has chosen a different method of execution based on how it expects the data to look.