We use a Fifo Queue configured with a Lambda Function as a processor. We use MessageGoupId and BatchSize to optimistically remove redundant messages. To rate limit processing we use reserved concurrency. Our function timeout has to be high. The queue maximum receives is set to ten.
When there are a lot of messages in the queue, the Lambda function scales up. Once it scales enough to throttle, the queue processing comes to a complete hold and no further messages are processed until some minutes have passed.
I’m assuming this is because of throttling, because halting always coincides with throttling and when increasing the reserved concurrency, it takes much longer for the processing to come to a halt.
I’m assuming the time until the queue starts up again has to do with the lambda retry limit, the function timeout and the queue visibility timeout. But since I don’t know exactly what happens this is a guess.
There are no errors logged and ultimately all messages are processed, but since processing is time and throughput sensitive, it is not acceptable to have the queue halt for many minutes.
What is happening and how can we solve this? If more information is needed I’d be very happy to debug further.
Edit: Found this: To allow your function time to process each batch of records, set the source queue's visibility timeout to at least 6 times the timeout that you configure on your function. The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
and we are certainly in violation, but I'm not sure how / if this explains the observed behaviour.
At the end of this answer there is a complete, minimal example that is written to easily reproduce the issue.
To deploy, create all files and fill in your aws profile
and desired region
into all sh
files.
Then run
. deploy-stack.sh
to create the cloudformation stack that contains all necessary resources.
Then open the AWS web interface (SQS) and run
. generate-messages.sh
to generate messages on the queue.
One can then see that about half of the messages are processed before the function throttles and the queue comes to a complete halt.
To remove the cloudformation stack after all debugging has been completed run remove-stack.sh
AWS documentation contains a note saying
To allow your function time to process each batch of records, set the source queue's visibility timeout to at least 6 times the timeout that you configure on your function. The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
Changing the timeout
on the lambda function from 600
to 100
and redeploying the stack allows all messages to process correctly even though the lambda function throttles.
I can not explain why this behaviour is observed and would greatly appreciate feedback on that. However the above does solve the issue described.
stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Debug Stack for Fifo with Lambda Processor
Resources:
MyLambdaRole:
Type: AWS::IAM::Role
Properties:
RoleName:
Fn::Sub: lambda-role
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Action:
- sts:AssumeRole
Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AWSLambdaExecute
- arn:aws:iam::aws:policy/AmazonSqsFullAccess
Path: /
MySqsQueue:
Type: 'AWS::SQS::Queue'
Properties:
FifoQueue: true
VisibilityTimeout: 600
MySQSQueueFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt MyLambdaRole.Arn
Runtime: nodejs12.x
Timeout: 600
ReservedConcurrentExecutions: 5
Code:
ZipFile: |
exports.handler = (event, context) => new Promise((resolve) => {
setTimeout(resolve, 1000);
});
MySQSLambdaEventSource:
Type: AWS::Lambda::EventSourceMapping
Properties:
BatchSize: 1
Enabled: false
EventSourceArn: !GetAtt MySqsQueue.Arn
FunctionName: !Ref MySQSQueueFunction
Outputs:
QueueUrl:
Value:
Ref: MySqsQueue
EventSource:
Value:
Ref: MySQSLambdaEventSource
deploy-stack.sh
#!/bin/bash
profile=local
region=us-east-1
# -----------------
aws cloudformation deploy \
--profile $profile \
--region $region \
--template-file stack.yaml \
--stack-name fifo-lambda-debug \
--capabilities CAPABILITY_NAMED_IAM
generate-messages.sh
#!/bin/bash
profile=local
region=us-east-1
# -----------------
function genGroupId {
echo $(shuf -i 1-10 -n 1)
}
function genRndStr {
echo $(openssl rand -hex 12)
}
function entry {
echo "{\"Id\":\"$(genRndStr)\",\"MessageBody\":\"$(genRndStr)\",\"MessageGroupId\":\"$(genGroupId)\",\"MessageDeduplicationId\":\"$(genRndStr)\"}"
}
# -----------------
echo "Getting Subscription UUID..."
eventSource=$(aws cloudformation describe-stacks \
--query "Stacks[0].Outputs[?OutputKey=='EventSource'].OutputValue" \
--output text \
--profile $profile \
--region $region \
--stack-name fifo-lambda-debug)
echo "Getting Queue Url..."
queueUrl=$(aws cloudformation describe-stacks \
--query "Stacks[0].Outputs[?OutputKey=='QueueUrl'].OutputValue" \
--output text \
--profile $profile \
--region $region \
--stack-name fifo-lambda-debug)
echo "Disabling Subscription"
aws lambda update-event-source-mapping \
--profile $profile \
--region $region \
--uuid $eventSource \
--no-enabled \
> /dev/null
while : ; do
echo "Waiting until Subscription disabled..."
[[ $(aws lambda get-event-source-mapping \
--profile $profile \
--region $region \
--uuid $eventSource \
--query "State") != '"Disabled"' ]] || break
sleep 10
done
echo "Queueing Messages..."
for i in {1..30}
do
aws sqs send-message-batch \
--profile $profile \
--region $region \
--queue-url "$queueUrl" \
--entries "[$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry)]" \
> /dev/null
echo "Done: $i / 30"
done
echo "Re-Enabling Subscription..."
aws lambda update-event-source-mapping \
--profile $profile \
--region $region \
--uuid $eventSource \
--enabled \
> /dev/null
remove-stack.sh
#!/bin/bash
profile=local
region=us-east-1
# -----------------
aws cloudformation delete-stack \
--profile $profile \
--region $region \
--stack-name fifo-lambda-debug