Cassandra/Scylla cartesian product in a simple WHERE IN query?

I have a simple table like group(id varcahr(64) primary key, size bigint) and I need to find if a set of ids are in the table. I use a simple query:

SELECT id FROM group WHERE id IN (<id1>..<idn>);

idn can be pretty big and when I run the query, I get the following error:

java.sql.SQLTransientException: com.datastax.oss.driver.api.core.servererrors.ServerError: clustering-key cartesian product size 107 is greater than maximum 100

Differently from this question I'd like to understand why a cartesian product is needed when a search in the index would be enough and would not require the product which I agree can get to an unmanageable size.

Solution

The error message (which I assume you got from ScyllaDB?) is misleading:

Both Scylla and Cassandra want to limit on the number of different row queries that one query may turn to; For example, imagine that you have WHERE a in (1,2,3,4,5) AND b in (1,2,3,4,5) this can result in 5x5, i.e., 25 separate row requests. This sort of "cartesian product" can quickly get huge even with a short-looking query, which is why it is limited by two configuration options, max_partition_key_restrictions_per_query and max_clustering_key_restrictions_per_query. Both of these options default in ScyllaDB to 100.

Although exceeding this limit can happen easily with a Cartesian product of several "IN", as you saw it can also happen if you have just one "IN" with just a very long list that has more than the configured limit (100). In that case, the error message is indeed misleading.

Although the fact that max_clustering_key_restrictions_per_query defaults to 100 has dubious merit (see https://github.com/scylladb/scylladb/issues/10577), in your case you are querying 107 different partitions so it is the configuration option max_partition_key_restrictions_per_query is the one causing your problem and also set by default to 100. In your use case, you can safely increase this configuration option, and allow bigger IN lists.

Just don't overdo max_partition_key_restrictions_per_query - the code to read a huge number of random partitions is inefficient, and there is usually a bettter way to do this. The most efficient way is to store related rows in one partition, and then read the entire partition - instead of indexing 1,000 different rows by key. This may, of course, require you to change your data model and I don't know if or how it fits your use case.