Search code examples
scalaapache-sparkapache-kafkaspark-streamingapache-spark-1.4

Spark + Kafka integration - mapping of Kafka partitions to RDD partitions


I have a couple of basic questions related to Spark Streaming

[Please let me know if these questions have been answered in other posts - I couldn't find any]:

(i) In Spark Streaming, is the number of partitions in an RDD by default equal to the number of workers?

(ii) In the Direct Approach for Spark-Kafka integration, the number of RDD partitions created is equal to the number of Kafka partitions. Is it right to assume that each RDD partition i would be mapped to the same worker node j in every batch of the DStream? ie, is the mapping of a partition to a worker node based solely on the index of the partition? For example, could partition 2 be assigned to worker 1 in one batch and worker 3 in another?

Thanks in advance


Solution

  • i) default parallelism is number of cores (or 8 for mesos), but the number of partitions is up to the input stream implementation

    ii) no, the mapping of partition indexes to worker nodes is not deterministic. If you're running kafka on the same nodes as your spark executors, the preferred location to run the task will be on the node of the kafka leader for that partition. But even then, a task may be scheduled on another node.