Search code examples
apache-flinkflink-streaming

Differences between parallelism and multiple apps in Flink


I am planning to dynamically scale up/down a Flink app. The app consumes events from Kafka using the kafka-flink connector.

Since the "warm up" of the app takes few minutes (caching...) and changing parallelism level involves restarts, I prefer to submit (scale up) or alternatively kill (scale down) tasks instead of changing the parallelism level.

I wonder from performance, logic and execution plan, are there any differences between this approach and the Flink built-in parallel execution?

In other words, what would be the differences between 10 identical Flink tasks to one task with parallelism level = 10 ( env.setParallelism(10) )?


Solution

  • The number of parallelism will determent if the task is Redistributing or not

    • One-to-one streams (for example between the Source and the map() operators in the figure above) preserve the partitioning and ordering of the elements. That means that subtask1 of the map() operator will see the same elements in the same order as they were produced by subtask1 of the Source operator.
    • Redistributing streams (as between map() and keyBy/window above, as well as between keyBy/window and Sink) change the partitioning of streams. Each operator subtask sends data to different target subtasks, depending on the selected transformation. Examples are keyBy() (which re-partitions by hashing the key), broadcast(), or rebalance() (which re-partitions randomly). In a redistributing exchange the ordering among the elements is only preserved within each pair of sending and receiving subtasks (for example, subtask1 of map() and subtask[2] of keyBy/window). So in this example, the ordering within each key is preserved, but the parallelism does introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the sink.