Retry deleting pool or jobs using Azure Batch?

I am using this Microsoft tutorial for a starting point for using Azure Batch pools, jobs, and containers.

I have altered their code for deleting pools and jobs slighty to

// Cleanup Batch Account Resources
// Clean up Job
await batchClient.JobOperations.DeleteJobAsync($"{BatchConstants.JobIdPrefix}-{Guid}");
            
// Clean up Pool
await batchClient.PoolOperations.DeletePoolAsync($"{BatchConstants.PoolIdPrefix}-{Guid}");

This works great when I run this code locally, but when it goes up to my development environment it runs into an issue when deleting the pool or job (usually the job). I get back the status code "ServiceUnavailable".

When I manually login to the Azure portal, I can see the container were deleted without issue (so I know the connection is able to be made and can successfully delete Azure objects), but notice the pool and job are still alive.

It does not appear that JobOperations or PoolOperations have a notion of retry policies, so is there any other way I can have it retry deleting pools and/or jobs a few more times if it gets back a ServiceUnavailable status? Or should I just try it in essentially a for loop that runs up to 5 (or so) times if it gets back a bad status code or continues with the rest of the program if a good status code comes back?

Thanks for the help.

Solution

You can provide a retry policy on the batchClient itself which will apply to all operations that are retryable (i.e., the client will automatically retry the operation on your behalf if it is a retryable operation). For example, to add a linear retry policy that retries every 5 seconds for 10 attempts maximum:

batchClient.CustomBehaviors.Add(RetryPolicyProvider.LinearRetryProvider(TimeSpan.FromSeconds(5), 10))

You can use any of the existing retry policies or create your own retry policy by implementing the IRetryPolicy interface.

Typically ServiceUnavailable will recover on its own due to some temporary outage or issue. However with that being said, you may still need to handle cases where even these retry policies fail after the maximum number of attempts. That will depend on what is acceptable for your scenario, for example, you may be ok that job deletion fails for an extended period of time, but it is not ok for pool deletion to fail for a period greater than some time. You may want to include more robust retry handling in that case or fallback alerting and notification in your system.