Search code examples
apache-flinkflink-streaming

Flink could not recovered from checkpoint after the Jobmanager was restarted


I am running flink cluster on Docker-compose with 1 jobmanager and 1 taskmanager. I tested the mechanism of checkpoint by restart the container of Jobmanager. But I found the state was not restore properly. On the other hand, when I restarted the container of TaskManager, it worked perfectly. Is that worked by design? And how could I recover the job from checkpoint when the JobManager is restarted?


Solution

  • Flink requires that you configure High Availability in order to recover from job manager failures. The details of how to go about this depend on how your cluster is deployed: you can either setup Zookeeper to manage this, or rely on Kubernetes. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/ha/overview/ for details.