Search code examples
amazon-web-servicesaws-fargateaws-batch

How can I get AWS Batch to run more than 2 or 3 jobs at a time?


I'm just getting started with AWS. I have a (rather complicated) Python script which reads in some data from an S3 bucket, does some computation, and then exports some results to the same S3 bucket. I've packaged everything in a Docker container, and I'm trying to run it in parallel (say, 50 instances at a time) using AWS Batch.

I've set up a compute environment with the following parameters:
Type: MANAGED
Provisioning model: FARGATE
Maximum vCPUs: 256

I then set up a job queue using that compute environment.

Next, I set up a job definition using my Docker image with the following parameters:
vCpus: 1
Memory: 6144

Finally, I submitted a bunch of jobs using that job definition with slightly different commands and sent them to my queue.

As I submitted the first few jobs, I saw the status of the first 2 jobs go from RUNNABLE to STARTING to RUNNING. However, the rest of them just sat there in the RUNNABLE state until the first 2 were finished.

Does anyone have any idea what the bottleneck might be to running more than 2 or 3 jobs at a time? I'm aware that there are some account limitations, but I'm not sure which one might be the bottleneck.


Solution

  • Turns out there were 3 things at play here:

    1. There was a service quota on my account of 5 public IP addresses, and each container was getting its own IP address so it could communicate with the S3 bucket. I made one of the subnets a private subnet and put all my containers in that subnet. I then set up a NAT gateway in a public subnet and routed all my traffic through the gateway. (More details at https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/)

    2. As Marcin pointed out, Fargate does scale slowly. I switched to using EC2, which scaled much more quickly but still stopped scaling at around 30 container instances.

    3. There was a service quota on my account called "EC2 Instances / Instance Limit (All Standard (A, C, D, H, I, M, R, T, Z) instances)" which was set to 32. I reached out to AWS, and they raised the limit, so I am now able to run over 100 jobs at once.