Search code examples
apache-flinkflink-streamingfault-tolerancecheckpoint

How to understand checkpoint in Flink correctly


I know that Flink uses checkpoint mechanism to guarantee Exactly-once. But I want to know more details.

If I'm right, each Operator has its own checkpoint. I can not understand how these checkpoints work together.

Saying that I have two source tasks A and B, and one operator C. A and B are the inputs of C.

It seems that C must wait for both of the checkpoint of A and the checkpoint of B. But how do we decide the interval of them? If the operator of C does this: output = a1 + a2 + a3 - b1, does it mean that we should set the interval of the checkpoint of B three times of the checkpoint of A?

In a word, my question is if we should do some design for the interval of checkpoint of each operator according to its job and the frequency of its inputs to avoid long-duration-waiting-for-checkpoint issue?


Solution

  • I am not sure if I follow your question. You set the checkpoint interval for the whole job not on a per operator basis. This determines the interval on which checkpoint barriers will be injected into the stream at sources. Then it traverses through the same channel as regular events. Upon receiving a checkpoint barrier a single operator checkpoints its state corresponding to that particular checkpoint (each checkpoint barrier contains checkpoint id). This way the whole job can take a consistent snapshot of all operators at that point in the stream.

    If you want a more thorough explanation how it exactly works have a look here: Data Streaming Fault Tolerance