Search code examples
apache-flink

Job-Manager is not recovering Zookeeper checkpoints


We deployed Flink job cluster (1 job-manager and 1 task-manager) in our K8s environment and configured it to HA mode (connected to Zookeeper). The job is stateful and checkpoint is enabled using RocksDB backend. The problem is that task-manager restarts are properly recovered from the last checkpoint but job-manager restarts are not:

[flink-akka.actor.default-dispatcher-5]recover: 2018-11-27 11:23:26,531 INFO  o.a.f.r.c.ZooKeeperCompletedCheckpointStore Recovering checkpoints from ZooKeeper.
[flink-akka.actor.default-dispatcher-5]recover: 2018-11-27 11:23:26,596 INFO  o.a.f.r.c.ZooKeeperCompletedCheckpointStore Found 0 checkpoints in ZooKeeper.
[flink-akka.actor.default-dispatcher-5]recover: 2018-11-27 11:23:26,597 INFO  o.a.f.r.c.ZooKeeperCompletedCheckpointStore Trying to fetch 0 checkpoints from storage.

The checkpoints are persisted to Google Cloud Storage and Zookeeper.

The relevant properties in flink-conf.yaml:

metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
high-availability: zookeeper
high-availability.zookeeper.quorum: our-k8s-zookeeper-service:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /service_cluster
high-availability.storageDir: gs://our-flink-bucket/namespace/service/ha
high-availability.jobmanager.port: 6123
state.backend.fs.memory-threshold: 0
state.checkpoints.dir: gs://our-flink-bucket/namespace/service/checkpoints

What are we missing here?


Solution

  • Finally we found the problem, it seems to by a bug in Flink 1.6.1 (this one).

    Upgrade to 1.6.2 resolved it.