Search code examples
kubernetesjobskubernetes-cronjob

Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?


I was reading the Kubernetes documentation about jobs and retries. I found this:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

I had two questions about the above quote:

  1. The back-off limit value is on minutes or number of retries? The documentation example using the value 6 (six) is confuse, because he initially affirms that the value is the number of retries but after that said "capped at six minutes".
  2. There is a way to define the back-off delay time? As I understand, this behavior (10s, 20s, 40s …) is default and can't be changed.

Solution

  • No confusion about the .spec.backoffLimit is is the number of retries.

    The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.

    • If the Pod fails, after 10s new Pod will be created
    • If fails again, after 20s new one will be created
    • If fails again, after 40s new one comes
    • If fails again, next one comes after 80s (1m 20s)
    • If fails again, next one comes after 160s (2m 40s)
    • If fails again, after 320s (5m 20s), new Pod comes
    • If fails again, after 360s (not 640s, cause it is greater than 360s or 6m) you will see the next one