I receive the following in the portal:
There was an error while deleting [THUMBPRINT HERE]. The server returned 500 error. Do you want to try again?
I suspect that there is an azure batch pool / node hanging on to the certificate, however the pool / nodes using that certificate have been deleted already (at least they are not visible in the portal).
Is there a way to force delete the certificate, in normal operation my release pipeline is reliant on being able to delete the certificate.
Intercepting azure powershell with fiddler, I can see this in the http response, so it appears to be timing out.
{
"odata.metadata":"https://ttmdpdev.northeurope.batch.azure.com/$metadata#Microsoft.Azure.Batch.Protocol.Entities.Container.errors/@Element","code":"OperationTimedOut","message":{
"lang":"en-US","value":"Operation could not be completed within the specified time.\nRequestId:[REQUEST ID HERE]\nTime:2017-08-23T16:54:23.1811814Z"
}
}
I have also deleted any corresponding tasks and schedules, still no luck.
(Disclosure: At the time of writing, I work on the Azure Batch team, though not on the core service.)
500 errors are usually transient and may represent heavy load on Batch internals (as opposed to 503s which represent heavy load on the Batch API itself). The internal timeout error reflects this. It's possible there was an unexpected spike in demand on specific APIs which are high-cost but are normally low-usage. We monitor and mitigate these, but sometimes an extremely high load with an unusual usage pattern can impact service responsiveness. I'd suggest you keep trying every 10-15 minutes, and if it doesn't clear itself in a few hours then try raising a support ticket.
There is currently no way to force-delete the certificate. This is an internal safety mechanism to ensure that Batch is never in a position where it has to deploy a certificate of which it no longer has a copy. You could request such a feature via the Batch UserVoice.
Finally, regarding your specific scenario, you could see whether it's feasible to rejig your workflow so it doesn't have the dependency on certificate deletion. You could, for example, have a garbage collection tool (perhaps running using Azure Functions or Azure Scheduler) that periodically cleans out old certificates. Arguable this adds more complexity (and arguably shouldn't be necessary) but it improves resilience and in other ways simplifies the solution as your main path no longer needs to worry so much about delays and timeouts. If you want to explore this path then perhaps post on the Batch forums and kick off a discussion with the team about possible design approaches.