Search code examples
azureazure-batch

Azure Batch Preempted state


I have a TVM/pool running under Azure batch and suddenly it went into the Preempted state. Now the problem is, it is not taking any requests now.

I have also setup Scale formula wherein it gives me a VM whenever I have more then 0 job pending to be executed in the Azure batch. But apparently that is not working either. It was working before the TVM went into the preempted state.

How to deal with these situation?


Solution

    • AFAIK, the nodes I think are low-priority nodes can go into the "preempted" state depending on available capacity. For this reason, low-priority VMs are most suitable for certain types of workloads. Use low-priority VMs for batch and asynchronous processing workloads where the job completion time is flexible and the work is distributed across many VMs .and that is the behavior defined here: https://learn.microsoft.com/en-us/azure/batch/batch-low-pri-vms

    • I think very likely the latter part of question is also relevant to the fact that your VM's were prepempted.

    Given the characteristics of low-priority VMs, what workloads can and cannot use them? In general, batch processing workloads are a good fit, as jobs are broken into many parallel tasks or there are many jobs that are scaled out and distributed across many VMs.

    To maximize use of surplus capacity in Azure, suitable jobs can scale out.

    Occasionally VMs may not be available or are preempted, which results in reduced capacity for jobs and may lead to task interruption and reruns. Jobs must therefore be flexible in the time they can take to run.

    Jobs with longer tasks may be impacted more if interrupted. If long-running tasks implement checkpointing to save progress as they execute, then the impact of interruption is reduced. Tasks with shorter execution times tend to work best with low-priority VMs, because the impact of interruption is far less.

    Long-running MPI jobs that utilize multiple VMs are not well suited to use low-priority VMs, because one preempted VM can lead to the whole job having to run again.

    Hope it helps.