I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge
and m5.xlarge
instances everything works fine and my job is picked up and runs.
However, when I begin to include other instance types in my compute environment such as m5.2xlarge
, the job is stuck in the runnable
state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.
I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:
When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.
The JobDefinition
is multi-node:
My computing environment max vCPUs is set to 10,000
, is always in a VALID
state and always ENABLED
. Also my EC2 vCPU limit is 6,000
. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal
setting for instance types because I ran into issues with not getting enough instances.
I just resolved this issue, the problem is with the BEST_FIT
strategy in Batch. The jobs that I'm submitting are not close enough to the instance type so they never get picked up.
I figured this out by modifying the job definition to use 8 vCPU and 30GB
of memory and the job began with the m5.2xlarge
instances.
I'm going to see if using the BEST_FIT_PROGRESSIVE
strategy will resolve this issue and report back, although I doubt it will.
--
Update: I have spoken with AWS Support and got some more insight. The BEST_FIT_PROGRESSIVE
allocation strategy has built-in protections for over-scaling so that customers do not accidentally launch thousands of instances. Although this has the side effect of what I am experiencing which leads to jobs failing to start.
The support engineers recommendation was to use a single instance type in the Compute Environment and the BEST_FIT
allocation strategy. Since my jobs have different instance requirements I was able to successfully create three separate ComputeEnvironments targeting difference instances types (c5.large, c5.xlarge, m4.xlarge
), submit jobs and have them run in the appropriate Compute Environment.