In Spark, there's the minPartitions
setting. Here's what it's for as per documentation:
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1
mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this
option to a value greater than your topicPartitions, Spark will divvy up large Kafka
partitions to smaller pieces. Please note that this configuration is like a hint: the
number of Spark tasks will be approximately minPartitions. It can be less or more
depending on rounding errors or Kafka partitions that didn't receive any new data.
Does Flink feature anything similar? I read that there's the KafkaPartitionSplit
, but how do you use that?
The documentation also says this:
Source Split #
A source split in Kafka source represents a partition of Kafka topic. A Kafka source split consists of:
TopicPartition the split representing
Starting offset of the partition
**Stopping offset of the partition, only available when the source is running in bounded mode**
Does splitting the source like this only make sense when processing historical data?
Your Kafka source in the Flink workflow has a parallelism, which defines how many sub-tasks (each running in a separate slot) will be used to read from your Kafka topic. Flink hashes the topic name to pick the first sub-task that's assigned partition 0, and then continues (in a round-robin manner) to assign partitions to sub-tasks.
So you typically want to set your source parallelism such that the number of partitions is a multiple of the parallelism. E.g. if you have 40 partitions, set your parallelism to 40, or 20, or 10, such that each sub-task is processing the same number of partitions (and thus mitigate data skew issues).