I'm running a Spark Streaming application that reads data from Kafka. I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch. This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition
configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.