apache-spark cassandra apache-spark-sql datastax-java-driver spark-cassandra-connector

Getting BusyPoolException com.datastax.spark.connector.writer.QueryExecutor , what wrong me doing?

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.

I have my spark-submit or spark cluster environment as below to load 2 billion records.

--executor-cores 3 
--executor-memory 9g 
--num-executors 5 
--driver-cores 2 
--driver-memory 4g

Using following configurration

cassandra.concurrent.writes=1500
cassandra.output.batch.size.rows=10
cassandra.output.batch.size.bytes=2048
cassandra.output.batch.grouping.key=partition 
cassandra.output.consistency.level=LOCAL_QUORUM
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128

Job is taking around 2 hrs , it really huge time

When I check logs I see WARN com.datastax.spark.connector.writer.QueryExecutor - BusyPoolException

how to fix this ?

Solution

You have incorrect value for cassandra.concurrent.writes - this means that you're sending 1500 concurrent batches at the same time. But by default, Java driver allows 1024 simultaneous requests. And usually, if you have too high number for this parameter, could lead to overload of the nodes, and as result - retries for tasks.

Also, other settings are incorrect - if you sepcify cassandra.output.batch.size.rows, then its value overrides the value of cassandra.output.batch.size.bytes. See corresponding section of the Spark Cassandra Connector reference for more details.

One of the aspects of performance tuning is to have correct number of Spark partitions, so you reach good parallelism - but this really depends on your code, how many nodes in Cassandra cluster, etc.

P.S. Also, please note that configuration parameters should be started with spark.cassandra., not with simple cassandra. - if you specified them in this form, then these parameters are ignored and defaults are used.