Search code examples
pythondjangoceleryamazon-sqs

Celery infinite retry pattern issue


I am using celery with AWS SQS for async tasks.

@app.task(
    autoretry_for=(Exception,),
    max_retries=5,
    retry_backoff=True,
    retry_jitter=False,
    acks_late=True,
)
@onfailure_reject(non_traced_exceptions=NON_TRACED_EXCEPTIONS)
def send_order_update_event_task(order_id, data):
    .........

But the retry pattern is getting very much messed up when I use an integer value for the retry_backoff arg. No of tasks spawning up are getting out of control.

logs:

       2024-12-10 05:16:10  
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [53285c923f-79232a3856]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [1052f09663-c19b42589a]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [dd021828dd-4f6b8ae6f8]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [116bef9273-e4dbfb526b]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [913697ae7e-d4f65d45a5]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [d99e889882-a76718b549]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [d99e889882-30bac3e515]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [d7f01e5b4f-edfa22355f]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [8ba15966ae-2266247e56]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [738688f34d-34067ca58b]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [c790586783-b363d38520]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [6231986f4c-7696b7cf47]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10
2024-12-10 05:16:10 
ERROR [1b810665-c0b1-4527-8cd9-c142f67d6605] [e020ded4ca-f11c933d87]  tasks.order_request_task -  [ send_order_update_event_task] Exception for order: 700711926: Order absent 700711926, retry_count: 10

I am printing the retry count for each of the retries but there seems to be multiple tasks with same retry count, for example there are 20 retries for retry count 1, 40 for retry count 2 and so on. I am not sure why this is happening. One specific queue(celery-requests-primary) is being used for performing these tasks and all these tasks are running in one deployment called celery-requests-primary which has multiple pods. What might be causing this? Is any other information needed for this to be debugged


Solution

  • This is related to the visibility_timeout configuration of SQS queues. as per documentation if a task isn’t acknowledged within the visibility_timeout, the task will be redelivered to another worker and executed. This causes problems with retry tasks where the time to execute exceeds the visibility timeout; if that happens it will be executed again, and again in a loop. So we have to increase the visibility timeout to match the time of the longest ETA(retry exhaustion) we’re planning to use.