I am running flink cluster on Docker-compose with 1 jobmanager and 1 taskmanager. I tested the mechanism of checkpoint by restart the container of Jobmanager. But I found the state was not restore properly. On the other hand, when I restarted the container of TaskManager, it worked perfectly. Is that worked by design? And how could I recover the job from checkpoint when the JobManager is restarted?
Flink requires that you configure High Availability in order to recover from job manager failures. The details of how to go about this depend on how your cluster is deployed: you can either setup Zookeeper to manage this, or rely on Kubernetes. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/ha/overview/ for details.