Search code examples
dockerapache-flinkamazon-ecs

Flink resume from externalised checkpoint question


I am using Flink running inside ECS installed from docker-flink. I have enabled externalized checkpoint to AWS S3 via state.checkpoints.dir to S3 in flink-conf.yaml.

Now according to Flink documentation here if we want resume from a checkpoint in case of failure we have to say bin/flink run -s :checkpointMetaDataPath [:runArgs] but I use FLINK_HOME/bin standalone-job.sh start-foreground. So I am not able to figure out how my Flink job would resume from externalized checkpoint in case of failure.

Do we really need to have some config option option of resuming from checkpoint? Can't JM as part of restart strategy automatically read last offsets from state store? I am new to Flink.


Solution

  • The referred Dockerfile alone won't start a Flink job. It will only start a Flink session cluster which is able to execute Flink jobs. The next step is to use bin/flink run to submit a job. Once you have a job, which has enabled checkpointing via StreamExecutionEnvironment.enableCheckpointing, submitted and running it will create checkpoints to the configured location.

    If you have retaining of checkpoints enabled, then you can cancel the job and resume it from a checkpoint via bin/flink run -s ....

    Job cluster

    In case that you are running a per job cluster where the image already contains the user code jars, then you can resume from a savepoint by starting the image with --fromSavepoint <SAVEPOINT_PATH> as a command line argument. Note that <SAVEPOINT_PATH> needs to be accessible from container running the job manager.

    Update

    In order to resume from a checkpoint when using standalone-job.sh you have to call

    FLINK_HOME/bin/standalone-job.sh start-foreground --fromSavepoint <SAVEPOINT/CHECKPOINT_PATH>