Search code examples
cassandracqlcassandra-3.0scyllacqlengine

IN Query in Cassandra Where clause


I have a Scylla cluster with 3 Nodes and 1 Table created with the below Query

CREATE TABLE id_features (
    id int PRIMARY KEY,
    id_feature_1 int,
    id_feature_2 int,

)

I am issuing below query from the application SELECT * FROM id_features where id in (1,2,3,4...120); The query can have a maximum of 120 ids.

Will this Query contact all 3 nodes based on the token value of id`s to fetch data for 120 ids in the worst case? Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability

Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?


Solution

  • Will this Query contact all 3 nodes based on the token value of ids to fetch data

    Do the replication factor, consistency level, and load balancing policy will play any role in deciding the node?

    It very much depends on things like replication factor (RF), query consistency, and load balancing policy. Specifically, if RF < number of nodes, then multiple nodes will be contacted, based on the hashed token value of id and the nodes primarily assigned to those token ranges.

    But, given this statement:

    Or only 1 node will be contacted to fetch the data for all the ids and multiple nodes are used only for high availability

    ...I get the sense that RF=3 in this case.

    If the app is configured to use the (default) TokenAwarePolicy then yes, for single-key queries only, requests can be sent to the individual nodes.

    But in this case, the query is using the IN operator. Based on the 120 potential entries, the query cannot determine a single node to send the query. In that case, the TokenAwarePolicy simply acts as a pass-through for its child policy (DCAwareRoundRobinPolicy), and it will pick a node at LOCAL distance to be the "coordinator." The coordinator node will then take on the additional tasks of routing replica requests and compiling the result set.

    As to whether or not non-primary replicas are utilized in query plans, the answer is again "it depends." While the load balancing policies differ in implementation, in general all of them compute query plans which:

    • are different for each query, in order to balance the load across the cluster;
    • only contain hosts that are known to be able to process queries, i.e. neither ignored nor down;
    • favor local hosts over remote ones.

    Taken from: https://docs.datastax.com/en/developer/java-driver/3.6/manual/load_balancing/#query-plan

    So in a scenario where RF = number of nodes, a single node sometimes may be used to return all requested replicas.

    Pro-tip:

    Try not to use the IN operator with a list of 120 partition key entries. That is forcing Cassandra to perform random reads, where it really excels at sequential reads. If that's a query the app really needs to do, try:

    • Building a new table to better support that query pattern.
    • Not exceed double-digits of entries for IN.