Search code examples
apache-flinkflink-streaming

Data serialization internals between Flink FORWARD-connected operators


As per my understanding, operators connected by a data forwarding connection can be chained. If they are chained, there's only a defensive copy happening between them (by default).

Given this execution graph: Job Graph of the Fraud Detection Flink Job

  • Are all the operators in the Transaction Source sharing data by reference/defensive copies?
  • What happens between Transaction Source and Dynamic Partitioning? Are they chained because it's a FORWARD data connection (and it's just visually separated because of the broadcast)?

Solution

  • Chained operators are always shown as one task. So all operators separated by -> are chained in your Sources.

    The Dynamic part is not chained, as it's has more than one input. While there is some support in Flink to chain operators with multiple inputs, it's hard to implement and not recommended in DataStream API. Table API/SQL will use that mechanism to automatically merge everything into one task though. [1]

    For non-chained forward channels, data is sent over a local network channel. So it's serialized and divided into buffers but it's not hitting the network interface.

    [1] https://developpaper.com/flink-sql-performance-optimization-multiple-input/