Search code examples
amazon-web-servicesaws-batch

AWS Batch Job Stuck in Runnable State


I'm trying to run a 100 node AWS Batch job, when I set my computing environment to use only m4.xlarge and m5.xlarge instances everything works fine and my job is picked up and runs.

However, when I begin to include other instance types in my compute environment such as m5.2xlarge, the job is stuck in the runnable state indefinitely. The only variable I am changing in these updates is the instance types in the compute environment.

I'm not sure what is causing this job to not be picked up when I include other instance types in the computing environment. In the documentation for Compute Environment Parameters the only note is:

When you create a compute environment, the instance types that you select for the compute environment must share the same architecture. For example, you can't mix x86 and ARM instances in the same compute environment.

The JobDefinition is multi-node:

  • Node 0
    • vCPUs: 1
    • Memory: 15360 MiB
  • Node 1:
    • vCPUs: 2
    • Memory: 15360 MiB

My computing environment max vCPUs is set to 10,000, is always in a VALID state and always ENABLED. Also my EC2 vCPU limit is 6,000. CloudWatch provides no logs because the job has not started, I'm not sure what else to try here. I am also not using the optimal setting for instance types because I ran into issues with not getting enough instances.


Solution

  • I just resolved this issue, the problem is with the BEST_FIT strategy in Batch. The jobs that I'm submitting are not close enough to the instance type so they never get picked up.

    I figured this out by modifying the job definition to use 8 vCPU and 30GB of memory and the job began with the m5.2xlarge instances.

    I'm going to see if using the BEST_FIT_PROGRESSIVE strategy will resolve this issue and report back, although I doubt it will.

    --

    Update: I have spoken with AWS Support and got some more insight. The BEST_FIT_PROGRESSIVE allocation strategy has built-in protections for over-scaling so that customers do not accidentally launch thousands of instances. Although this has the side effect of what I am experiencing which leads to jobs failing to start.

    The support engineers recommendation was to use a single instance type in the Compute Environment and the BEST_FIT allocation strategy. Since my jobs have different instance requirements I was able to successfully create three separate ComputeEnvironments targeting difference instances types (c5.large, c5.xlarge, m4.xlarge), submit jobs and have them run in the appropriate Compute Environment.