Search code examples
azure-batch

Batch nodes restarting and batch pool settings


I have couple questions about Azure Batch pool:

  1. I noticed that sometimes, while running a job, especially when I'm running a large number of tasks like 10000, some/many compute nodes in my pools shut down by themselves and restart. I would like to know what can cause the nodes in batch pool to shut down and restart during execution?

  2. Is it possible to change pool configuration parameters other than size/scale after it was created? For instance, I wanted to change the sku of the VMs or the setting of the number of tasks per node. If yes, can it be done through Azure portal or does it have to be done programmatically?

Thanks!


Solution

    1. This would need to get investigated by the Azure Batch team. You can raise a support ticket in the portal where you can specify your account name, region, pool id, job id with some sample approximate times when this happened. It will also be helpful if you keep your VM active.
    2. You can update any of the pool properties specified in this document; note that some updates require a reboot to the compute node to take effect. For the two specific parameters you specified (VM size and max tasks per node), unfortunately those parameters cannot be patched after the pool is created. You will need to either recreate the pool with the new parameters, or if you need to drain existing jobs with no downtime, you can create a new pool and migrate any jobs/tasks targeting the existing pool to the new pool by disabling the existing jobs with requeue option of your liking, patching the job to target the new pool, and then re-enabling the job.