Search code examples
pythonrediscelery

Celery: what is the reason to have acks_late=true without setting task_reject_on_worker_lost=true


After playing with some "defect" scenarios with celery (Redis being a broker for whatever it worth) we came to understanding that there is effectively no sense in setting acks_late=true without simultaneous setting of task_reject_on_worker_lost=true because the task won't be rescheduled (again, in our tests) -- task stays in the "unacked" category forever.

At the same time everybody says that acks_late will make the task being subject for rescheduling on the same / another worker, so the question is: when does it happen?

The official docs say that

Note that the worker will acknowledge the message if the child process executing the task is terminated (either by the task calling sys.exit(), or by signal) even when acks_late is enabled. This behavior is intentional as…

  • We don’t want to rerun tasks that forces the kernel to send a SIGSEGV (segmentation fault) or similar signals to the process.

  • We assume that a system administrator deliberately killing the task does not want it to automatically restart.

  • A task that allocates too much memory is in danger of triggering the kernel OOM killer, the same may happen again.

  • A task that always fails when redelivered may cause a high-frequency message loop taking down the system.

If you really want a task to be redelivered in these scenarios you should consider enabling the task_reject_on_worker_lost setting.

What are possible examples of "something went wrong" that don't fall into the "worker terminated deliberately or due to a signal caught" category?


Solution

  • Reboot, power outage, hardware failure. n.b., all of your examples assume that the prefetch multiplier is 1.