Search code examples
parallel-processingarchitectureapache-flinkflink-streaming

How to understand slot sharing and parallelism in Apache Flink


I'm trying to figure out slot sharing and parallelism in Flink with the example WordCount.

enter image description here

Saying that I need to do the word count job with Flink, there are only one data source and only one sink.

In this case, can I make a design just like the image above? I mean, I set two sub-tasks on Source + map() and two sub-tasks on keyBy()/window()/apply(), in other words, I have two lines: A --- B --- Sink and C --- D --- Sink so that I can get a better performance.

For example, there is a data stream coming: aaa, bbb, aaa. With the design above, I may get such a situation: aaa and bbb goes into the A --- B and the other aaa goes into the C --- D. And finally, I can get the result aaa: 2, bbb: 1 at the Sink. Am I right for now?

If I'm right, I know that subtasks of the same task cannot share a slot, so does it mean that A and C can't share a slot, B and D can't share a slot? Am I right? How do I assign the slots? Should I put A + B + Sink into one slot and put C + D into another slot?


Solution

  • Slot sharing is enabled by default. With slot sharing enabled, the number of slots required is the same as the parallelism of the task with the highest parallelism (which is two in this case).

    In this example the scheduler will put A + B + Sink into one slot, and C + D into another. This isn't something you normally need to configure or even give much thought to, as the defaults work well in most cases.

    If you were to completely disable slot sharing, then this job would require 5 slots, one for each of A, B, C, D, and the sink. But disabling slot sharing is almost never a good idea. Just make sure each slot has sufficient resources to run all of the subtasks concurrently.