Search code examples
apache-flinkflink-streaming

Flink rebalance and chaining strategy


Background

We're using dataStream.rebalance() to create equal load on our partitions. However, we usually set the chaining strategy, to HEAD for example, to allow multithread transformations allocation.

Question

Is setting the chaining strategy just before rebalancing the recommended practice or does Flink automatically allow multithread transformations allocation after rebalance?


Solution

  • As a user, you usually never set the chaining strategy. You only set it if you have custom operators. In fact, we are currently deprecating chaining strategy on operator level and only allow it on operator factory level.

    By default, all operators are ALWAYS chainable. That means, as long as they share the same slot and are connected with a forward channel, the network/local channel is skipped and records are directly handed to the next transformation. Consequentially, no operators can be chained if they are connected through any kind of shuffle connections (e.g., rebalance).

    So without changing any chaining strategy, you will get long pipelines that are separated by any shuffle operation.

    Now if you change it to HEAD for all operators directly following a shuffle operation, you will actually have a no-op. Head means that the operator can only be the head of an operator chain. If you also change the strategy of all transformations following the shuffle, you will actually get no chain at all. (That point was not entirely clear from your question)

    TL;DR don't change chaining strategy unless you implement your own operator. It won't be faster than the default.