python-3.x cassandra nosql nosql-aggregation datastax-python-driver

Query with IN operator results in NoHostAvailable, ConnectionBusy "host is overloaded"

We are trying to fetch the data from the Cassandra data base using IN query to filter the data for specific time duration as mentioned below:

"SELECT * FROM "+ DATABASES['default']['NAME'] +".perform_stats WHERE device_id IN (" + ','.join(device_id_lst) + " AND created_at <= " + created_at +" and created_at >= " start_time + ";"

But randomly observing the below error log.

cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: host:9042 datacenter1>: ConnectionBusy('Connection host:9042 is overloaded')})
   File "/opt/app-root/lib64/python3.8/site-packages/cassandra/cqlengine/query.py", line 515, in __iter__
   File "/opt/app-root/lib64/python3.8/site-packages/cassandra/cqlengine/query.py", line 472, in _execute_query
   File "/opt/app-root/lib64/python3.8/site-packages/cassandra/cqlengine/query.py", line 404, in _execute
   File "/opt/app-root/lib64/python3.8/site-packages/cassandra/cqlengine/query.py", line 1531, in _execute_statement
   File "/opt/app-root/lib64/python3.8/site-packages/cassandra/cqlengine/connection.py", line 345, in execute
   File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
   File "cassandra/cluster.py", line 4877, in cassandra.cluster.ResponseFuture.result
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: host:9042 datacenter1>: ConnectionBusy('Connection host:9042 is overloaded')})

In single query we are trying to fetch approx 10 devices data for 15 min duration and for each devices we have 4 entry for 15 min. Thus total approx 40 entries per queries. But observation is system memory cache consumption increases drastically though its heap memory is approx 6/8 GB and randomly observing the above mentioned error.

you can refer to below schema for reference

CREATE TABLE sample.perform_stats (
    device_id uuid,
    created_at bigint,
    stats_data text,
    PRIMARY KEY (device_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Current environments details are below:

CentOS Version: 7.9
Cassandra version: 4.1.0
Java Version: 11.0.18
python Version: 3.6.8
Django: 3.1.4
django-cassandra-engine: 1.6.1
cassandra-driver==3.24.0

We have tried have tried increasing the memory and reducing the number of devices count from 2500 to 10 devices per query. Also we tried to by replacing the django ORM query with Cassandra RAW query. But as of now there is no benefit.

Can anyone suggest what can be possible reason for this issue and way to further debugging to resolve this?

Solution

This is expected behaviour. The issue is that your query is overloading the nodes in the cluster.

When you use the IN() operator on multiple partitions, the coordinator of the query has to fire off multiple separate requests to get all the partitions. It is a scatter-gather access pattern that is very expensive which means that the client requests queue up on the coordinators until their queue fills up.

When a coordinator's queue reaches capacity (because it is waiting for other nodes to respond to multiple requests), it is no longer able to accept new requests leading to the ConnectionBusy error. It gets to a point where all the nodes have maxed out their queues leading to the NoHostAvailable exception.

The IN() operator is not designed for multi-partition filtering. Instead, it should only be used for filtering rows within a single partition (only used to filter on a clustering column).

In my experience, using the IN() operator on a partition key only works when there are 2 items to filter on, maybe 3 at most. Anything more and the performance becomes very unpredictable.

If the IN() operator is used on a partition key, it indicates to me that the data model isn't right. You should partition your table based on the application query. Cheers!