How to structure an elastic Azure Batch application?

I am evaluating Batch for a project and, while it seems like it will do what I'm looking for, I'm not sure if what I am assuming is really correct.

I have what is basically a job runner from a queue. The current solution works but when the pool of nodes scales down, it just blindly kills off machines. I am looking for something that, when scaling down, will allow currently-running jobs to complete and then remove the node(s) from the pool. I also want to preemptively increase the pool size if a spike is likely to occur (and not have those nodes shut down). I can adjust the pool size externally if that makes sense (seems like the best option so far).

My current idea is to have one pool with one job & task per node, and that task listens to a queue in a loop for messages and processes them. After an iteration count and/or time limit, it shuts down, removing that node from the pool. If the pool size didn't change, I would like to replace that node with a new one. If the pool was shrunk, it should just go away. If the pool size increases, new nodes should run and start up the task.

I'm not planning on running something that continually add pools, or nodes to the pool, or tasks to a job, though I will probably have something that sets the pool size periodically based on queue length or something similar. What I would rather not do is something like "there are 10 things in the queue, add a pool with x nodes, then delete it".

Is this possible or are my expectations incorrect? So far, from reading the docs, it seems like it should be doable, and I have a simple task working, but I'm not sure about the scaling mechanics or exactly how to structure the tasks/jobs/pools.

Solution

Here's one possible way to lean into the strengths of Azure Batch and achieve what you've described.

Create your job with a JobManagerTask that monitors your queue for incoming work and adds a new Batch Task for each item of your workload. Each task will process a single piece of work, then exit.

The Batch Scheduler will then take care of allocating tasks to compute nodes. It can also take care of retrying tasks that fail, and so on.

Configure your pool with an AutoScale formula to dynamically resize your pool to meet your load. Your formula can specify taskcompletion to ensure tasks get to complete before any one compute node is removed.

If your workload peaks are predictable (say, 9am every day) your AutoScale expression could scale up your pool in anticipation. If those spikes are not predicable, your external monitoring (or your JobManager) can change the AutoScale expression at any time to suit.

If appropriate, your job manager can terminate once all the required tasks have been added; set onAllTasksComplete to terminatejob, ensuring your job is completed after all your tasks have finished.

A single pool can process tasks from multiple jobs, so if you have multiple concurrent workloads, they could share the same pool. You can give jobs different values for priority if you want certain jobs to be processed first.