Search code examples
sqlamazon-redshiftin-clause

Using IN Condition with large number of items


I have local data with id's from an external database. I want to then select data from that external database using those id so I do something like:

SELECT * FROM table WHERE id IN (:listofids)

However I've noticed (I'm not sure if it's a DB by DB setting or the type of DB or what) there can be a max number of values allowed in that IN statement. When possible of course I do WHERE id IN (SELECT id FROM ...) but sometimes with external data it's not possible.

My questions:

  1. What are my options here besides what I am doing in the above?
  2. Are the limitations of max items set at the DB level, the type of DB, what? Right now I am specifically concerned with an external RedShift DB but I looked at their docs and they don't mention a limit. Our local DB has a limit of 2100 but another one I use is 9999. Also, once the list gets large it can get slow, so i'm looking for a performance boost as well.
  3. Should I do in () OR in () OR in(). That doesn't seem too slick. Are all of these options viable?
  4. I've read a bit about possibly using temp tables to do this but without any examples. How would I load data into a temp table to then join into the table I'm wanting to get the data from?

Solution

  • What are my options here besides what I am doing in the above?

    They are limited - you could chain a LONG sequence of OR id = X calls but you may have SIGNIFICANT performance problems compared to IN.

    The best option performance-wise is to join to a table (parameter, temporary, CTE, inline UNION, or static) on the server side. However that's not always an option depending on how you're executing queries and what permissions you have.

    Of course, you can also execute a separate query for each value, which would be easier to code but might have performance problems as well.

    Are the limitations of max items set at the DB level, the type of DB, what?

    That is platform-specific so there's not a universal answer

    Should I do in () OR in () OR in(). That doesn't seem too slick. Are all of these options viable?

    That is one option to add to the others above - you'd have to try it to see what the performance impact is.

    I've read a bit about possibly using temp tables to do this but without any examples. How would I load data into a temp table to then join into the table I'm wanting to get the data from?

    That depends on the platform, your permissions, your API available, etc.