Search code examples
apache-flinkflink-streamingflink-cepflink-sqlflink-batch

Apache Flink - How Checkpoint/Savepoint works If we run duplicate jobs (Multi Tenancy)


I have multiple Kafka topics (multi tenancy) and I run the same job run multiple times based on the number of topics with each job consuming messages from one topic. I have configured file system as state backend.

Assume there are 3 jobs running. How does checkpoints work here? Does all the 3 jobs store the checkpoint information in the same path? If any of the job fails, how does the job knows from where to recover the checkpoint information? We used to give a job name while submitting a job to the flink cluster. Does it have anything to do with it? In general how does Flink differentiate the jobs and its checkpoint information to restore in case of failures or manual restart of the jobs (irrespective of same or different jobs)?

Case1: What happens in case of job failure?

Case2: What happens If we manually restart the job?

Thank you


Solution

  • The JobManager is aware of each job checkpoint, and keep that metadata, checkpoint is being save to the checkpoint directory(via flink-conf.yaml), under this directory it`ll create a randomly hash directory for each checkpoint.

    Case 1: The Job will restart (depend on your Fallback Strategy...), and if checkpoint is enabled it'll read the last checkpoint.

    Case 2: Im not 100% sure, but i think if you cancel the job manually and then submit it, it won't read the checkpoint. You'll need to use savepoint. (You can kill your job with savepoint, and then submit your job again with the same savepoint). Just be sure that every oprator has a UID. you can read more about savepoints here: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html