I have a flink job deployed on a local kind cluster, it saves checkpoints to AWS S3.
The following error kept occurring in job manager log at the initial stage:
2023-07-07 19:33:48,657 INFO org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger checkpoint for job 1ff112ff1bdf5c91c6e88f4112ecaf25 since Checkpoint triggering task Source: App Event Source (1/1) of job **** is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running..
but it disappeared, and checkpoint started working normally after these two logs:
"2023-07-07 18:23:03,819 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: App Event Source (1/1) (****) switched from INITIALIZING to RUNNING."
"2023-07-07 18:23:04,719 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Audit filter -> App Transform -> App Event Sink: Writer -> App Event Sink: Committer (1/1) (*****) switched from INITIALIZING to RUNNING."
Is there a way to fix this?
I would presume that this is potentially an issue between your source and the checkpointing frequency. It appears the job is attempting take a checkpoint prior to this specific source operator being fully initialized/ready, and thus failing the checkpoint.
You could consider adjusting increasing the interval to see if that makes a noticeable difference, however it’s worth noting that Flink will attempt to retry the checkpoint should it fail (up to the allowed failures threshold). It can be common for one to fail occasionally but succeed on the subsequent retry (such that no checkpoints are missed).