Search code examples
amazon-s3architectureaws-lambdaaws-step-functions

Load Control Function in AWS Step Function


An AWS Step Function State Machine has a Lambda Function at its core, that does heavy writes to a S3 bucket. When the State Machine gets a usage spike, the function starts failing due to S3 blocking further requests (com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate.). This obviously leads to failures of the state machine execution as a whole and it takes the whole system some minutes to fully recover.

I looked into the AWS Lambda Function Scaling Documentation and found out, that when we reduce the reserved concurrency flag, the function will start to return 429 status codes, as soon as it can't handle new events.

So my idea to load control the function execution can be summarized as following:

  1. Set the reserved concurrency to some lower value.
  2. Catching the 429 errors in the step function and retrying with a backoff rate.

I'd like to have feedback from you guys, on the following aspects:

a. Does my approach make sense or am I missing some obvious better way? I first thought of looking into managing the load with AWS SQS or some execution wide locking/semaphore but didn't really see any further. b. Is there maybe another way to tackle the issue from the S3 side?


Solution

  • This approach worked well for me:

    States:
     MyFunction:
      Type: Task
      End: true
      Resource: "..."
      Retry:
       - ErrorEquals:
          - TooManyRequestsException
         IntervalSeconds: 30
         MaxAttemtps: 5
         BackoffRate: 2