We are using Datastax spark-cassandra-connector to write to a Cassandra Cluster deployed on a different cluster from spark.
We have observed for bulk loads i.e ~500M records our write runs for (~1 hour), and the read performance goes down during the write is in action. While write performance is pretty good, this is unacceptable in our environment, as some read requests are critical and should be always responded in a specific time frame.
I read an article on SSL Table Loader Use Case, which appears to solve the same issue by using SSLTableLoader(CassandraBulkLoader).
I also read a few SO questions like this one mentioning write can be really slow with SSLTableLoader compared to the spark-cassandra-connector.
Now, What is the underlying reason that makes spark-cassandra-connector faster but cause the low read latency for bulk load? Also, are there any other drawbacks to SSLTableLoader than being slow?
It's a normal - if you're writing data as fast as possible, it creates the load on the disk system, and your reads are becoming slower. Besides just writing the data onto the disk, you need to take into account the additional load to the IO system from things like compaction. It's also possible that your compaction throughput isn't very good, and because of that you may have compaction lagging behind, and this may lead to additional read latency because you have too many sstable files.
You not necessary need to use sstableloader
for data loading. You can just tune write parameters, so Spark won't overload your nodes. This may include, for example, following parameters:
spark.cassandra.output.concurrent.writes
- decrease it to 2 or 3 instead of default 5 - this will increase load time, but should decrease load to the serversspark.cassandra.output.throughputMBPerSec
, but I would suggest to start with the previous option.Another option for bulk loading of the data could be the DataStax's DSBulk, that can load data from CSV & JSON files. By default it also tries to load the data as fast as possible, but it has options for controlling throughput.