Azure Batch tasks stuck in Running state

I have several tasks on Azure Batch which are stuck in Running state although the node server does not know anything about it (not running there, no folders found). Any task manipulation in GUI (Terminate, Delete, Show files on node) end with There was an error while terminating task t20171129-0010-03. The server returned '500 Internal Server Error'.. This happened several times on different pools / jobs / tasks.

Now I have checked the debug files on node itself and the issue seems to be caused by failed to extend lease and subsequently deleting the task from node, but aborting attempt to update task table without an active queue lease.

Is this something I can avoid, or is it just a bug in the Azure Batch service? What exactly is the "lease" and how often it needs to be renewed? (My Azure subscription does not contain Technical Support).

Interesting lines from log:

agent.task.lease■lease.py■_renew_lease_unsafe_async■106■1398■MainThread■139690855581440■extending lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
requests.packages.urllib3.connectionpool■connectionpool.py■_make_request■387■1398■Thread-1■139690661328640■"PUT /pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=XXX&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA HTTP/1.1" 404 221
azurestorage.helper.HTTPNotFoundError: 404 Client Error: The specified message does not exist. for url: https://watbl2prod1.queue.core.windows.net/pd1batch-a-fa357c64-5c3d-4db8-9366-680943d2c20d/messages/821bf60d-3ba5-43a1-9c3d-c7500758bfea?sv=2015-07-08&se=2017-12-06T00%3A42%3A17Z&sp=up&sig=mU9501N4HHuDeRWuA7qMNni9M%2Fbb83OWLF8AW0%2B4nQE%3D&visibilitytimeout=360&popreceipt=AwAAAAMAAAAAAAAAFePc%2BR5u0wEBAAAA
agent.task.lease■lease.py■_renew_lease_unsafe_async■119■1398■MainThread■139690855581440■failed to extend lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06
agent.task.manager■manager.py■handle_task_lease_extension_error_async■4713■1398■MainThread■139690855581440■deleting task pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0 because lease was lost
agent.task.manager■manager.py■_postprocess_execute_task_async■2255■1398■MainThread■139690855581440■updating row in task table for: pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0
agent.task.manager■manager.py■_update_tasktable_entity_async■1624■1398■MainThread■139690855581440■aborting attempt to update task table without an active queue lease for pd1batch 22F55DC6E98C8653$1a-python 22F4F1C234F19066$job-1$t20171129-0010-06$0

Entire log: https://pastebin.com/fkqTRuBe

Solution

Currently, Azure Batch tasks have a limit of a total lifetime of 7 days, from the time it is submitted to the job as noted here.

When this limit is reached, there are issues in the system that prevent the update of the task state from propagating. However, if you observe the node state where the task ran, it will return to idle (assuming no other tasks are scheduled to it or are currently running).

You have a few options to avoid this situation. If your workload is amenable to scale up or migrating to a more performant VM type such that your task completes in under the time limit. If you can scale out your problem (or scale it out further) by performing distribution computation or chunking the problem into smaller sizes and running it in an embarrassingly parallel fashion this may help resolve your issue.

The current behavior is not very user friendly. There are plans to increase this limit in the future.