Search code examples
amazon-web-servicesamazon-s3amazon-ec2amazon-ecs

Running Batch Jobs on Amazon ECS


I'm very new to using AWS, and even more so for ECS. Currently, I have developed an application that can take an S3 link, download the data from that link, processes the data, and then output some information about that data. I've already packaged this application up in a docker container and now resides on the amazon container registry. What I want to do now is start up a cluster, send an S3 link to each EC2 instance running Docker, have all the container instances crunch the numbers, and return all the results back to a single node. I don't quite understand how I am supposed to change my application at this point. Do I need to make my application running in the docker container a service? Or should I just send commands to containers via ssh? Then assuming I get that far, how do I then communicate with the cluster to farm out the work for potentially hundreds of S3 links? Ideally, since my application is very compute intensive, I'd like to only run one container per EC2 instance.

Thanks!


Solution

  • Your story is hard to answer since it's a lot of questions without a lot of research done.

    My initial thought is to make it completely stateless.

    You're on the right track by making them start up and process via S3. You should expand this to use something like an SQS queue. Those SQS messages would contain an S3 link. Your application will start up, grab a message from SQS, process the link it got, and delete the message.

    The next thing is to not output to a console of any kind. Output somewhere else. Like a different SQS queue, or somewhere.

    This removes the requirement for the boxes to talk to each other. This will speed things up, make it infinitely scalable and remove the strange hackery around making them communicate.

    Also why one container per instance? 2 threads at 50% is the same as 1 at 100% usually. Remove this requirement and you can use ECS + Lambda + Cloudwatch to scale based on the number of messages. >10000, scale up, that kind of thing. <100 scale down. This means you can throw millions of messages into SQS and just let ECS scale up to process them and output somewhere else to consume.