Search code examples
apache-kafkacassandraspark-streamingspark-cassandra-connectorapache-spark-standalone

Why does my Spark app only use one of two workers in my cluster?


I use standalone cluster with 2 workers. Use spark kafka cassandra hdfs stream

val stream = kafkaUtils.createDirectStream...
stream.map(rec => Row(rec.offset, rev.value)).saveToCassandra(...)
stream.map(_.value).foreachRDD(rdd => {saving to HDFS})

I send to Kafka approximately 40000 msg/sec the first thing that is saveToCassandra works slowly, because if i comment stream.saveToCassandra it works very good and fast. in spark driver UI i see that for 5MB output it takes approximately 20s. I tried to tune spark-cassandra options, but it also takes minimum 14s.

And the second is than i mentioned, that my one worker is do nothing, it log i see something like this:

10:05:33 INFO remove RDD#

and etc.

but if i stop another worker it begin to work.

I don't use spark-submit, just

startSpark extends App {

and the hole code, and then start it with

scala -cp "spark libs:kafka:startSpark.jar" startSpark

and in conf to workers i use ssc.sparkContext.addJars(pathToNeedableJars)

How can i boost writing to Cassandra and how to get my workers work together?


Solution

  • I really bad reading official spark kafka integration guide, the problem, that i use for my topic 1 partition

    1:1 correspondence between Kafka partitions and Spark partitions