Search code examples
apache-flink

What is operator in Flink? How operator state and keyed state are different?


Per my understanding, examples of operators in Flink are Source operator, Transformation operator, etc. Is my understanding correct with respect to operators in Flink?

In operator state, does Flink maintain the state of each operator like (map(), reduce(), etc for each job/task) or it maintains the state of one complete job/task? Also, if my job is submitted with more than one parallelism, will each slot has its own state?

Suppose, I have submitted two jobs which are keyed stream and both jobs have the same key say "color" but both jobs are totally different. Does Flink will maintain two different states or it is going to maintain one state for both jobs.


Solution

  • Whether operator state or keyed state, Flink state is always local: each operator instance has its own state. There is no sharing or visibility across JVMs or across jobs.

    As for how the two kinds of state differ: operator state is always on-heap, never in RocksDB. Operator state has limited type options -- ListState and BroadcastState -- and it cannot be ValueState or MapState, which are the most commonly used forms of keyed state. This stems from the different way in which it is distributed and rescaled.

    A StreamSource is an example of an operator, ProcessOperator is another (a ProcessOperator wraps around a user-supplied ProcessFunction). Transformations are not operators, their role is to apply operators to streams. For example, a OneInputTransformation applies a OneInputStreamOperator to an input.

    If you want to understand operators better, I recommend this talk by Addison Higham from Flink Forward SF 2019: Becoming a Smooth Operator: A look at low-level Flink APIs and what they enable.

    If you want to understand the internals of Flink, reading Stream Processing with Apache Flink by Hueske and Kalavri is really the best and only way to go.