Search code examples
pythonazureazure-batch

Job deletion and recreation in Azure Batch raises BatchErrorException


I'm writing a task manager for Azure Batch in Python. When I run the manager, and add a Job to the specified Azure Batch account, I do:

  1. check if the specified job id already exists
  2. if yes, delete the job
  3. create the job

Unfortunately I fail between step 2 and 3. This is because, even if I issue the deletion command for the specified job and check that there is no job with the same id in the Azure Batch Account, I get a BatchErrorException like the following when I try to create the job again:

Exception encountered:
The specified job has been marked for deletion and is being garbage collected.

The code I use to delete the job is the following:

def deleteJob(self, jobId):

    print("Delete job [{}]".format(jobId))

    self.__batchClient.job.delete(jobId)

    # Wait until the job is deleted
    # 10 minutes timeout for the operation to succeed
    timeout = datetime.timedelta(minutes=10)
    timeout_expiration = datetime.datetime.now() + timeout 
    while True:

        try:
            # As long as we can retrieve data related to the job, it means it is still deleting
            self.__batchClient.job.get(jobId)
        except batchmodels.BatchErrorException:
            print("Job {jobId} deleted correctly.".format(
                jobId = jobId
                ))
            break

        time.sleep(2)

        if datetime.datetime.now() > timeout_expiration:
            raise RuntimeError("ERROR: couldn't delete job [{jobId}] within timeout period of {timeout}.".format(
                jobId = jobId
                , timeout = timeout
                ))

I tried to check the Azure SDK, but couldn't find a method that would tell me exactly when a job was completely deleted.


Solution

  • Querying for existence of the job is the only way to determine if a job has been deleted from the system.

    Alternatively, you can issue a delete job and then create a job with a different id, if you do not strictly need to reuse the same job id again. This will allow the job to delete asynchronously from your critical path.