Search code examples
concurrencyarchitecturelockingamazon-ecs

Creating a scalable and fault tolerant system using AWS ECS


We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data. There are the options we're considering:

  1. Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
  2. All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
  3. Some other approach.

I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?


Solution

  • In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.

    A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.

    Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.

    This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.