As you know, with Cassandra, when nodes are overloaded, it may seriously hurt your production depending on required consistency, because nodes might become unresponsive, the entire daemon might also crash, hints might fill-up your data mount point, and so on.
So the keyword here is back-pressure
To do appropriate back-pressure
with Spark on Cassandra
, there are especially the following properties :
--conf "spark.cassandra.output.throughputMBPerSec=2"
--total-executor-cores 24
(There are also similar back-pressure
options with Datastax driver
, or cqlsh
. You basically limit the throughput per core, to apply some back-pressure
Let say, I found my global write throuput on my Cassandra cluster, and I set appropriate settings for my application1
that works fine.
BUT still, the challenge, is that there are many developers on a Cassandra cluster. So at a given time, I may have Spark application1
, application2
, application3
, ... that runs concurrently.
Question : What are my options to ensure that the write troughput (no matter how many applications runs concurrently) at a given time is globally NOT going to reach too much pressure for Cassandra, thus hurting my production workload ?
Thank you
What I recommend folks do to separate analytical workloads, is to spin-up another (logical) data center. Sure, it could be in the same physical data center. But what you want is separate compute and storage to keep the analytics load from interfering with the production traffic.
First, make sure that you're running with the GossipingPropertyFileSnitch
) and that your keyspaces are using the NetworkTopologyStrategy
. Likewise, you'll want to make sure that your keyspace definition contains a named data center and that your production application/services are configured to use that data center (ex: dc1
as below) as their default DC:
Once the new infra is up, install Cassandra and join the nodes to the cluster as a new DC by specifying the new name in the
file. Something like:
Next adjust your keyspace(s) to replicate data to the new DC.
Run a repair/rebuild on the new DC, and then configure the Spark jobs to only use dc1_analytics