Search code examples
apache-flinkflink-streaming

What is the restoring mechanism of the Flink on K8S when rollingUpdate is executed for update strategy?


I am wondering that the restoring procedure of checkpoint or savepoint in Flink when job is restarted by rolling updates on k8s.

Let me explain simple example as below. Assume that I have 4 pods in my flink k8s job and have following simple dataflow using parallelism 1.

source -> filter -> map -> sink

Each pod is responsible for each operator and data is consumed through the source function. Since I don't want to lose my data so I set up my dataflow as at least or exactly at once mode in Flink. And then when rolling update occurs, each pod gets restarted in a sequential way. Suppose that filter is managed by pod1, map is pod2, sink is pod3 and source is pod4 respectively. When the pod1 (filter) is restarted according to the rolling update, does the records in the source task (other task) is saved to the external place for checkpoint immediately? So it can be restored perfectly without data loss even after restarting?

And also, I am wondering that the data in map task (pod3) keep persistent to the external source when rolling update happens even though the checkpoint is not finished? It means that when the rolling update is happen, the flink is now processing the data records and the checkpoint is not completed. In this case, the current processed data in the task is loss?

I need more clarification for data restoring when we use checkpoint and k8s on flink updated by rolling strategy.


Solution

  • Flink doesn't support rolling upgrades. If one of your pods where a Flink application is currently running becomes unavailable , the Flink application will usually restart.

    The answer from David at Is the whole job restarted if one task fails explains this in more detail.

    I would also recommend to look at the current documentation for Task Failure Recovery at https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/task_failure_recovery/ and the checkpointing/savepointing documentation that's also listed there.