Search code examples
postgresqlquery-optimization

When `my_variable IN (<long_array_of_items>)` is fast in PostgreSQL?


I find my self sometimes needing to select many items of different tables based on a list of id's or other variables. Given that this list can contain thousands of elements, it is important for me that the filter does not become a n*m query. I have experienced that under some circumstances postgres makes fast lookups and sometimes not, for example if I filter by primary key. When it made slow filtering I typically solved it by creating an IN query table and join it with the table to make the filter.

But now I'm back at the problem again and I really want to hear if somebody can help me build some intuition for when I should assume that lookup is used and when a hash table is used.

At my current work we are using postgres 15, but most of my experience is from postgres 10, so there might be some bad habits from older versions.


Solution

  • v14 introduced an optimization (commit 50e17ad281b8d1c1b) to use a hash table for large IN-lists rather than doing a linear search over the IN-list for each candidate row. So since then, you shouldn't get O(n*m) behavior. There might be corner cases, like if the IN-list is computed dynamically or if the types are not fully compatible. Nothing in the query plan indicates that this optimization is being done (other than the lowered cost estimate).

    Even before v14, if the column being tested was indexed then it could hop on the index once for each member of the list and so avoid the O(n*m) behavior that way. Behind the scenes this is a loop, but in the query plan it is just presented as an index scan with the index condition being Index Cond: (col1 = ANY('{...