amazon-web-services batch-processing amazon-ecs aws-fargate aws-batch

Understanding where to begin with batch processing on AWS

I have a set of calculations that needs to run in a batch, and the workload is easily parallelized across machines. The work to be done is already done within a Docker container. I'm trying to understand the easiest way for me to run this workload in a highly parallel way on AWS. However, in trying to figure out where to begin I'm having trouble finding the right entrypoint. I read about AWS Batch and AWS Fargate, but each time I try to go down one of those paths to learn about them in more detail, more AWS services start popping up (Lamdas, Step Functions, ECS, AutoScaling groups), with each article having a different combination. Furthermore, I start thinking about the problem as a Batch vs Fargate problem, and then I find another article that talks about Batch + Fargate, or X + ECS + ....

I'm having trouble finding the appropriate introduction to the choices so I can get started with setting something up and getting some experience. Any pointers on which direction I might go or some resources for me to look at?

Solution

AWS containers services team member here. Your question triggers all my button cause I have been working on a deliverable to address some of this confusion ("where do I start with xyz?"). I can try to answer your question briefly here but if you want to read more (perhaps way more than you'd need feel free to contact me offline (mreferre at amazon dot com will work).

First and foremost it's not a Vs but it's an AND. Think of all these products you mention being distributed at different layers of the stack (this is a draft visual in the deliverable):

Fargate represents capacity (where your container is running), ECS represents a core containers orchestrator and Batch is one of the provisioners on top of the container orchestrator. Lambda is something separate and that live on its own. The options for your specific use case seem to be:

Lambda
ECS/Fargate
Batch/ECS/Fargate
Step Functions/ECS/Fargate (this one is outside of analysis and you don't see it in my visual - wondering if I should add it).

As others have hinted you probably want to use Lambda if your model is event-driven (e.g. if you want to fire up a dedicated function for every event like a new file uploaded to S3).

You probably do not want to use a naked ECS/Fargate solution because it would require more work to deal with the triggering and the scheduling of your batch jobs.

You probably want to use either Batch or Step Functions to schedule jobs on ECS/Fargate. I'd argue SF is good if you have basic workflows that you need to deal with and Batch if you need to manage complex jobs at scale. Perhaps this 35 mins presentation that I did last year can provide a bit more background on these Batch Vs SF differences.

Let me know if you have any additional questions because this discussion is super useful for the positioning I am trying to build.