Search code examples
apache-flink

Flink Failure Recovery: what if JobManager or TaskManager failed


I'm reading the Flink official doc about Task Failure Recovery: https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html

As my understanding, this doc tells us that if some task failed for some reason, Flink is able to recover it with the help of Checkpoint mechanism.

So now I have two more questions:

  1. What if a TaskManager failed? As my understanding, a task is assigned to one or more slots, and slots are located at one or more TaskManagers. After reading the doc above, I've known that Flink can recover a failed task, but if a TaskManager failed, what would happen? Can Flink recover it too? If a failed TaskManager can be recoverd, will the tasks assigned to it can continue running automatically after it's recovered?

  2. What if the JobManager failed? If the JobManager failed, do all of TaskManagers will fail too? If so, when I recover the JobManager with the help of Zookeeper, do all of TaskManagers and their tasks will continue running automatically?


Solution

  • In a purely standalone cluster, if a Task Manager dies, then if you had a standby task manager running, it will be used. Otherwise the Job Manager will wait for a new Task Manager to magically appear. Making that happen is up to you. On the other hand, if you are using YARN, Mesos, or Kubernetes, the cluster management framework will take care of making sure there are enough TMs.

    As for Job Manager failures, in a standalone cluster you should run standby Job Managers, and configure Zookeeper to do leader election. With YARN, Mesos, and Kubernetes, you can let the cluster framework handle restarting the Job Manager, or run standbys, as you prefer, but in either case you will still need Zookeeper to provide HA storage for the Job Manager's metadata.

    Task Managers can survive a Job Manager failure/recovery situation. The jobs don't have to be restarted.

    https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html.