I am new to Flink and I'm trying to understand a few things. I've got a theory which I am trying to confirm. So it goes like that:
- Parallelism refers to how many parallel "machines" (could be threads or different machines as I understand, correct me if I'm wrong) will run my job.
- Flink by default will partition the stream in a round-robin manner to take advantage of the job's parallelism.
- If the programmer defines a partitioning strategy (for example with keyBy) then this strategy will be followed instead of the default round-robin.
- If the parallelism is set to 1 then partitioning the stream will not have any effect on the processing speed as the whole stream will end up being processed by the same machine. In this case, the only benefit of partitioning a stream (with keyBy) is that the stream can be processed in keyed context.
- keyBy guarantees that the elements with the same key (same group) will be processed by the same "machine" but it doesn't mean that this machine will only process elements of this group. It could process elements from other groups as well but it processes each group as if it is the only one, independently from the others.
- Setting a parallelism of 3 while the maximum number of partitions that my partition strategy can spawn is 2, is kind of meaningless as only 2 of the 3 "machines" will end up processing the two partitions.
Can somebody tell me if those points are correct? Correct me if I'm wrong please.
Thank you in advance for your time
I think you've got it. To expand on point 6: If your job uses a keyBy
to do repartitioning, as in
source
.keyBy(...)
.window(...)
.sinkTo(...)
then in a case where the source is a Kafka topic with only 2 partitions,
the source operator will only have 2 active instances, but for the window and sink all 3 instances will have meaningful work to do (assuming there are enough distinct keys).
Also, while we don't talk about it much, there's also horizontal parallelism you can exploit. For example, in the job outlined above, the source task will run in one Java thread, and the task with the window and sink will run in another thread. (These are separate tasks because the keyBy
forces a network shuffle.) If you give each task slot enough hardware resources, then these tasks will be able to run more-or-less independently (there's a bit of coupling, since they're in the same JVM).