java apache-spark apache-kafka spark-streaming

Why is the "topics" argument of KafkaUtils.createStream() a Map rather then array?

Definition in docs:

org.apache.spark.streaming.kafka

Class KafkaUtils

static JavaPairReceiverInputDStream<String,String> createStream(JavaStreamingContext jssc, String zkQuorum, String groupId, java.util.Map<String,Integer> topics)

Create an input stream that pulls messages from Kafka Brokers.

Why is topics a Map (rather than a string array)?

I understand that the string key is the topic name. But what about the integer value? What should I fill in?

Solution

Read the Javadoc:

public static JavaPairReceiverInputDStream createStream(JavaStreamingContext jssc, String zkQuorum, String groupId, java.util.Map topics)

Create an input stream that pulls messages from Kafka Brokers. Storage level of the data will be the default StorageLevel.MEMORY_AND_DISK_SER_2.

Parameters: jssc - JavaStreamingContext object

zkQuorum - Zookeeper quorum (hostname:port,hostname:port,..)

groupId - The group id for this consumer

topics - Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread

Returns: DStream of (Kafka message key, Kafka message value)

The value of the Map is the number of partitions of the given topic name, which determines the number of threads that will be used to consume the topic.