Search code examples
optimizationarchitectureapache-flinkflink-streaming

How to know which operators can be chained in Apache Flink


I'm reading the doc of Apache Flink: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/runtime.html.

As the doc mentioned,

For distributed execution, Flink chains operator subtasks together into tasks. Each task is executed by one thread. Chaining operators together into tasks is a useful optimization: it reduces the overhead of thread-to-thread handover and buffering, and increases overall throughput while decreasing latency.

So, as my understanding, knowing which operators can be chained is important. But how could we know about it? I mean, how can we know which operators can be chained and which operators can not?

For example, in the example of WordCount,

enter image description here

When we start to code, how could we know that Source and map() can be chained, that map() and keyBy()/window()/apply() can not be chained?


Solution

  • Whenever two operators are connected by a FORWARDing data connection, they can be chained. In other words, a keyBy, rebalance, or change in parallelism (which is also a rebalance) forces network communication and makes operator chaining impossible.