Search code examples
mysqlsql-serverpostgresqlsqlitequery-optimization

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?


I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.

I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.

But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.

PostgreSQL:

It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.

MSSQL Server:

MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.

MySQL Server:

The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.

SQLite:

Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.

So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).

It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?

Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.

I think in some cases query parsing time (if it was not prepared) alone can kill performance.

Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.

But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.

Do you?


Solution

  • Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.

    Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:

    Returns 1 if expr is equal to any of the values in the IN list, else returns 0. If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.

    This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.

    I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.