AWS Batch: how to increase concurrency for Fargate jobs

I'm trying to use Batch for large-scale parallelised job execution, with Docker containers. I would like to process thousands of tasks simultaneously.

I have everything up and running. My compute environment is configured with a max vCPUs of 2048. Each task is configured to use a single vCPU, and 2GB of RAM. I am using an array job with 1,000 array elements (for now).

Problem is: when I create a new job, concurrency seems to be extremely limited. When I look at the cluster in ECS, "pending tasks" seems to constantly hover around 50 (it might not ever go above 50), and "running tasks" doesn't go far above 30. Even though each individual task only takes ~10 seconds to complete, the entire batch takes ~20 minutes.

This isn't what I expected. With the above settings, I thought Batch would process all 1,000 tasks at the same time.

I originally thought the problem might have been caused by my use of a public subnet (all Fargate containers had public IPs). I changed to use a private subnet (with NAT gateway), but it didn't help.

Does anyone know what I'm doing wrong?

Thanks!

Solution

Answer is in the comments above, but for posterity:

The AWS Batch Compute Environment with Fargate resources is not scaling fast enough for your needs, since each job launches in it's own Fargate resource. Using EC2 for the Compute Environment will launch a large instance that will run multiple jobs on it concurrently, so scaling running jobs will be much faster.

As to why you are seeing that max (pending 50 running 30) this is likely because your requests reached an equilibrium of launching / finished. If your jobs ran longer than 10 seconds (minutes) you would see the total number of running tasks would go higher than you are seeing.