Search code examples
amazon-web-servicesaws-lambdaamazon-sqsthrottling

AWS Fifo Halts on Lambda Throttle


We use a Fifo Queue configured with a Lambda Function as a processor. We use MessageGoupId and BatchSize to optimistically remove redundant messages. To rate limit processing we use reserved concurrency. Our function timeout has to be high. The queue maximum receives is set to ten.

Observed

When there are a lot of messages in the queue, the Lambda function scales up. Once it scales enough to throttle, the queue processing comes to a complete hold and no further messages are processed until some minutes have passed.

I’m assuming this is because of throttling, because halting always coincides with throttling and when increasing the reserved concurrency, it takes much longer for the processing to come to a halt.

I’m assuming the time until the queue starts up again has to do with the lambda retry limit, the function timeout and the queue visibility timeout. But since I don’t know exactly what happens this is a guess.

Issue

There are no errors logged and ultimately all messages are processed, but since processing is time and throughput sensitive, it is not acceptable to have the queue halt for many minutes.

Questions

What is happening and how can we solve this? If more information is needed I’d be very happy to debug further.


Edit: Found this: To allow your function time to process each batch of records, set the source queue's visibility timeout to at least 6 times the timeout that you configure on your function. The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch. and we are certainly in violation, but I'm not sure how / if this explains the observed behaviour.


Solution

  • How to Reproduce

    At the end of this answer there is a complete, minimal example that is written to easily reproduce the issue.

    To deploy, create all files and fill in your aws profile and desired region into all sh files.

    Then run

    . deploy-stack.sh

    to create the cloudformation stack that contains all necessary resources.

    Then open the AWS web interface (SQS) and run

    . generate-messages.sh

    to generate messages on the queue.

    One can then see that about half of the messages are processed before the function throttles and the queue comes to a complete halt.

    To remove the cloudformation stack after all debugging has been completed run remove-stack.sh

    Solution

    AWS documentation contains a note saying

    To allow your function time to process each batch of records, set the source queue's visibility timeout to at least 6 times the timeout that you configure on your function. The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.

    Changing the timeout on the lambda function from 600 to 100 and redeploying the stack allows all messages to process correctly even though the lambda function throttles.

    I can not explain why this behaviour is observed and would greatly appreciate feedback on that. However the above does solve the issue described.

    Files

    stack.yaml

    AWSTemplateFormatVersion: '2010-09-09'
    Transform: AWS::Serverless-2016-10-31
    Description: Debug Stack for Fifo with Lambda Processor
    Resources:
      MyLambdaRole:
        Type: AWS::IAM::Role
        Properties:
          RoleName:
            Fn::Sub: lambda-role
          AssumeRolePolicyDocument:
            Version: 2012-10-17
            Statement:
              - Action:
                  - sts:AssumeRole
                Effect: Allow
                Principal:
                  Service:
                    - lambda.amazonaws.com
          ManagedPolicyArns:
            - arn:aws:iam::aws:policy/AWSLambdaExecute
            - arn:aws:iam::aws:policy/AmazonSqsFullAccess
          Path: /
      MySqsQueue:
          Type: 'AWS::SQS::Queue'
          Properties:
            FifoQueue: true
            VisibilityTimeout: 600
      MySQSQueueFunction:
        Type: AWS::Lambda::Function
        Properties:
          Handler: index.handler
          Role: !GetAtt MyLambdaRole.Arn
          Runtime: nodejs12.x
          Timeout: 600
          ReservedConcurrentExecutions: 5
          Code:
            ZipFile: |
              exports.handler = (event, context) => new Promise((resolve) => {
                setTimeout(resolve, 1000);
              });
      MySQSLambdaEventSource:
        Type: AWS::Lambda::EventSourceMapping
        Properties:
          BatchSize: 1
          Enabled: false
          EventSourceArn: !GetAtt MySqsQueue.Arn
          FunctionName: !Ref MySQSQueueFunction
    Outputs:
      QueueUrl:
        Value:
          Ref: MySqsQueue
      EventSource:
        Value:
          Ref: MySQSLambdaEventSource
    

    deploy-stack.sh

    #!/bin/bash
    
    profile=local
    region=us-east-1
    
    # -----------------
    
    aws cloudformation deploy \
    --profile $profile \
    --region $region \
    --template-file stack.yaml \
    --stack-name fifo-lambda-debug \
    --capabilities CAPABILITY_NAMED_IAM
    

    generate-messages.sh

    #!/bin/bash
    
    profile=local
    region=us-east-1
    
    # -----------------
    
    function genGroupId {
      echo $(shuf -i 1-10 -n 1)
    }
    function genRndStr {
      echo $(openssl rand -hex 12)
    }
    function entry {
      echo "{\"Id\":\"$(genRndStr)\",\"MessageBody\":\"$(genRndStr)\",\"MessageGroupId\":\"$(genGroupId)\",\"MessageDeduplicationId\":\"$(genRndStr)\"}"
    }
    
    # -----------------
    
    echo "Getting Subscription UUID..."
    eventSource=$(aws cloudformation describe-stacks \
    --query "Stacks[0].Outputs[?OutputKey=='EventSource'].OutputValue" \
    --output text \
    --profile $profile \
    --region $region \
    --stack-name fifo-lambda-debug)
    
    echo "Getting Queue Url..."
    queueUrl=$(aws cloudformation describe-stacks \
    --query "Stacks[0].Outputs[?OutputKey=='QueueUrl'].OutputValue" \
    --output text \
    --profile $profile \
    --region $region \
    --stack-name fifo-lambda-debug)
    
    echo "Disabling Subscription"
    aws lambda update-event-source-mapping \
    --profile $profile \
    --region $region \
    --uuid $eventSource \
    --no-enabled \
    > /dev/null
    
    while : ; do
        echo "Waiting until Subscription disabled..."
        [[ $(aws lambda get-event-source-mapping \
          --profile $profile \
          --region $region \
          --uuid $eventSource \
          --query "State") != '"Disabled"' ]] || break
        sleep 10
    done
    
    echo "Queueing Messages..."
    for i in {1..30}
    do
      aws sqs send-message-batch \
      --profile $profile \
      --region $region \
      --queue-url "$queueUrl" \
      --entries "[$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry),$(entry)]" \
      > /dev/null
      echo "Done: $i / 30"
    done
    
    echo "Re-Enabling Subscription..."
    aws lambda update-event-source-mapping \
    --profile $profile \
    --region $region \
    --uuid $eventSource \
    --enabled \
    > /dev/null
    

    remove-stack.sh

    #!/bin/bash
    
    profile=local
    region=us-east-1
    
    # -----------------
    
    aws cloudformation delete-stack \
    --profile $profile \
    --region $region \
    --stack-name fifo-lambda-debug