I am very new to Spark Streaming.I have some basic doubts..Can some one please help me to clarify this:
My message size is standard.1Kb each message.
Number of Topic partitions is 30 and using dstream approach to consume message from kafka.
Number of cores given to spark job as :
( spark.max.cores=6| spark.executor.cores=2)
As I understand that Number of Kafka Partitions=Number of RDD partitions:
In this case dstream approach:
dstream.forEachRdd(rdd->{
rdd.forEachPartition{
}
**Question**:This loop forEachPartiton will execute 30 times??As there are 30 Kafka partitions
}
Also since I have given 6 cores,How many partitions will be consumed in parallel from kafka
Questions: Is it 6 partitions at a time
or
30/6 =5 partitions at a time?
Can some one please give little detail on it on how this exactly work in dstream approach.
"Is it 6 partitions at a time or 30/6 =5 partitions at a time?"
As you said already, the resulting RDDs within the Direct Stream will match the number of partitions of the Kafka topic.
On each micro-batch Spark will create 30 tasks to read each partition. As you have set the maximum number of cores to 6 the job is able to read 6 partitions in parallel. As soon as one of the tasks finishes a new partition can be consumed.
Remember, even if you have no new data in on of the partitions, the resulting RDD still get 30 partitions so, yes, the loop forEachPartiton
will iterate 30 times within each micro-batch.