Search code examples
apache-flinkrestart

Automatic job resumption on flink cluster restart


I am running jobs on a standalone flink cluster with a single Job Manager, running in a docker container. Whenever the cluster crashes and restarts, I have to submit the jobs manually again for them to start. Is there a way to make flink resume the jobs automatically once the cluster is running again?


Solution

  • If a job crashes because it throws an exception, the job manager will automatically restart it so long as (1) you have checkpointing enabled (it's disabled by default because it requires some configuration), and (2) you haven't set a restart strategy that prevents restarts (the default restart strategy is fine). If a task manager in a standalone cluster completely fails, you'll need to start another one.

    To configure job manager failover, see the docs on high availability for standalone clusters.