An AWS Step Function State Machine has a Lambda Function at its core, that does heavy writes to a S3 bucket. When the State Machine gets a usage spike, the function starts failing due to S3 blocking further requests (com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate.
). This obviously leads to failures of the state machine execution as a whole and it takes the whole system some minutes to fully recover.
I looked into the AWS Lambda Function Scaling Documentation and found out, that when we reduce the reserved concurrency flag, the function will start to return 429
status codes, as soon as it can't handle new events.
So my idea to load control the function execution can be summarized as following:
429
errors in the step function and retrying with a backoff rate.I'd like to have feedback from you guys, on the following aspects:
a. Does my approach make sense or am I missing some obvious better way? I first thought of looking into managing the load with AWS SQS or some execution wide locking/semaphore but didn't really see any further. b. Is there maybe another way to tackle the issue from the S3 side?
This approach worked well for me:
States:
MyFunction:
Type: Task
End: true
Resource: "..."
Retry:
- ErrorEquals:
- TooManyRequestsException
IntervalSeconds: 30
MaxAttemtps: 5
BackoffRate: 2