After few successfully ingested data into Cassandra with Spark,
an error is now returned every time I try to ingest data with Spark (after few minutes or instantly) :
Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses
I checked with simple CQLSH (not Spark), and similar error is indeed returned too (2 nodes of 4) :
Connection error: ('Unable to connect to any servers', {'1.2.3.4': error(111, "Tried connecting to [('1.2.3.4', 9042)]. Last error: Connection refused")})
So basically, when I do ingestion into Cassandra with Spark, some nodes go down at some point. And I have to reboot the node, in order to access it again through cqlsh (and spark).
What is strange, is that it is still written "UP" for the given node when I run nodetool status
, while cqlsh
tells connection refused
for that node.
I try to investigate logs, but I have a big problem : nothing in the logs, no single exception triggered server-side.
What to do in my case ? Why a node go down or become unresponsive in that case ? How to prevent it ?
Thanks
!!! edit !!!
Some of the details asked for, bellow :
Cassandra infrastructure :
datacenter1
and datacenter2
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '2', 'datacenter2': '2'} AND durable_writes = true;
LOCAL_QUORUM
Spark application config :
spark.sql.catalog.cassandra.spark.cassandra.output.batch.size.rows 1
spark.sql.catalog.cassandra.spark.cassandra.output.concurrent.writes 100
spark.sql.catalog.cassandra.spark.cassandra.output.batch.grouping.key none
spark.sql.catalog.cassandra.spark.cassandra.output.throughputMBPerSec 80
spark.sql.catalog.cassandra.spark.cassandra.output.consistency.level LOCAL_QUORUM
spark.sql.catalog.cassandra.spark.cassandra.output.metrics false
spark.sql.catalog.cassandra.spark.cassandra.connection.timeoutMS 90000
spark.sql.catalog.cassandra.spark.cassandra.query.retry.count 10
spark.sql.catalog.cassandra com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions
(2 nodes of 4)
Just curious, but what is the replication factor (RF) of the keyspace, and what consistency level is being used for the write operation?
I'll echo Alex, and say that usually this happens because Spark is writing faster than Cassandra can process. That leaves you with two options:
One thing worth calling out:
2 replicas per datacenter
consistency used for input and output : LOCAL_QUORUM
So you'll probably get more throughput by dropping the write consistency to LOCAL_ONE
.
Remember, quorum == RF / 2 + 1, which means LOCAL_QUORUM
of 2 is 2.
So I do recommend dropping to LOCAL_ONE
, because right now Spark is effectively operating @ ALL
consistency.
Which JMX indicators I need to care about ?
Can't remember the exact name of it, but if you can find the metric for disk IOPs or throughput, I wonder if it's hitting a threshold and plateauing.