Search code examples
aws-lambdaaws-event-bridge

Managing EventBridge -> Lambda async retry behaviour


I'm trying to manage Lambda retries in a situation where Eventbridge asynchronously invokes a Lambda function via an events rule (see template at bottom)

I've tried to configure retry behavour on both Eventbridge and Lambda sides, in particular -

  • Event rule max retry attempts set to zero, and dead letter queue configured

  • Lambda event config configured with max retry attempts also set to zero, and Lambda destination queue also configured

I can push a "good" message to Eventbridge -

{'action': 'add', 'args': {'x': 2, 'y': 2}}

and this gets picked up by Lambda -

[INFO]  2021-11-19T06:56:25.242Z    590c6514-ad4d-4906-a748-9820af748e76    received: {'version': '0', 'id': '62f363a1-9e0e-a154-8d6a-bce81d22d47f', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:56:24Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 2, 'y': 2}}}
[INFO]  2021-11-19T06:56:25.242Z    590c6514-ad4d-4906-a748-9820af748e76    result: 4

I can also send a "bad" message to Eventbridge -

{'action': 'add', 'args': {'x': 1, 'y': 'a'}}

and this results in a Lambda error -

[INFO]  2021-11-19T06:50:49.603Z    b25129f4-d89a-493c-b85e-7ffaef995c71    received: {'version': '0', 'id': '8bb8b3d2-3725-8a24-19ea-547a6a8b799d', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:47:53Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'x'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]

So far so good - but problem is I still get standard Lambda retry behaviour at approx T+60 and T+180 seconds, resulting in further errors -

[INFO]  2021-11-19T06:52:46.142Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]
[INFO]  2021-11-19T06:53:50.326Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]
[INFO]  2021-11-19T06:55:59.477Z    897efce2-bb04-45d8-8b3b-4e1e854cdc13    received: {'version': '0', 'id': '56252e23-dbb1-8025-9eda-45cecaa9f04e', 'detail-type': 'foobar', 'source': 'whatevs', 'account': '119552584133', 'time': '2021-11-19T06:52:45Z', 'region': 'eu-west-1', 'resources': [], 'detail': {'action': 'add', 'args': {'x': 1, 'y': 'a'}}}
[ERROR] TypeError: unsupported operand type(s) for +: 'int' and 'str'Traceback (most recent call last):  File "/var/task/index.py", line 7, in handler    result=args["x"]+args["y"]

And the offending event never ends up in either the events DLQ nor the Lambda destination.

What am I missing here, and what do I need to do to turn off these retries and have the event show up in a DLQ/destination ?

(and for good measure, should error handling / retries be configured on the Eventbridge or Lambda sides ? Surely I don't need both ?)


AWSTemplateFormatVersion: '2010-09-09'
Outputs:
  MyEventBus:
    Value:
      Ref: MyEventBus
  MyEventsDLQ:
    Value:
      Ref: MyEventsDLQ
  MyFunctionDestination:
    Value:
      Ref: MyFunctionDestination
Parameters:
  LambdaHandlerName:
    Default: "index.handler"
    Type: String
  LambdaSize:
    Default: 512
    Type: Number
  LambdaRuntime:
    Default: 'python3.8'
    Type: String
  LambdaTimeout:
    Default: 5
    Type: Number
Resources:
  MyFunction:
    Properties:
      Code:
       ZipFile: |
         import logging
         logger=logging.getLogger()
         logger.setLevel(logging.INFO)
         def handler(event, context):
           logger.info("received: %s" % event)
           args=event["detail"]["args"]
           result=args["x"]+args["y"]
           logger.info("result: %s" % result)
      Handler:
        Ref: LambdaHandlerName
      MemorySize:
        Ref: LambdaSize
      Role:
        Fn::GetAtt:
        - MyFunctionRole
        - Arn
      Runtime:
        Ref: LambdaRuntime
      Timeout:
        Ref: LambdaTimeout
    Type: AWS::Lambda::Function
  MyFunctionRole:
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Action: sts:AssumeRole
          Effect: Allow
          Principal:
            Service: lambda.amazonaws.com
        Version: '2012-10-17'
      Policies:
      - PolicyDocument:
          Statement:
          - Action: logs:*
            Effect: Allow
            Resource: '*'
          - Action: sqs:*
            Effect: Allow
            Resource: '*'
          Version: '2012-10-17'
        PolicyName:
          Fn::Sub: my-function-role-policy-${AWS::StackName}
    Type: AWS::IAM::Role
  MyEventsFunctionPermission:
    Properties:
      Action: lambda:InvokeFunction
      FunctionName:
        Ref: MyFunction
      Principal: events.amazonaws.com
      SourceArn:
        Fn::GetAtt:
        - MyEventRule
        - Arn
    Type: AWS::Lambda::Permission
  MyEventRule:
    Properties:
      EventBusName:
        Ref: MyEventBus
      EventPattern:
        detail:
          action:
            - add
      State: ENABLED
      Targets:
      - Arn:
          Fn::GetAtt:
          - MyFunction
          - Arn
        Id:
          Fn::Sub: my-rule-${AWS::StackName}
        RetryPolicy:
          MaximumRetryAttempts: 0
        DeadLetterConfig:
          Arn:
            Fn::GetAtt:
              - MyEventsDLQ
              - Arn
    Type: AWS::Events::Rule
  MyEventBus:
    Properties:
      Name:
        Fn::Sub: my-event-bus-${AWS::StackName}
    Type: AWS::Events::EventBus
  MyEventsDLQ:
    Properties: {}
    Type: AWS::SQS::Queue
  MyEventsDLQPolicy:
    Properties:
      Queues:
        - Ref: MyEventsDLQ
      PolicyDocument:
        Statement:
          - Action: sqs:SendMessage
            Effect: Allow
            Principal:
              Service: events.amazonaws.com
    Type: AWS::SQS::QueuePolicy
  MyFunctionDestination:
    Properties: {}
    Type: AWS::SQS::Queue
  MyFunctionEventConfig:
    Properties:
      DestinationConfig:
        OnFailure:
          Destination:
            Fn::GetAtt:
            - MyFunctionDestination
            - Arn
      FunctionName:
        Ref: MyFunction
      MaximumRetryAttempts: 0
      Qualifier:
        Fn::GetAtt:
        - MyFunctionVersion
        - Version
    Type: AWS::Lambda::EventInvokeConfig
  MyFunctionVersion:
    Properties:
      FunctionName:
        Ref: MyFunction
    Type: AWS::Lambda::Version

Solution

  • Try setting Qualifier: $LATEST on MyFunctionEventConfig.

    As you say, the observed behaviour is consistent with the MyFunctionEventConfig Destination not being called at all. I suspect that is because you have qualified the Destination with a newly created Lambda version MyFunctionVersion. But I do not believe you are ever invoking that version. So the Destination also never gets invoked.

    Unless your AWS::Lambda::Version is doing work for you, you can delete it and use Qualifier: $LATEST.

    Edit - Further info:

    Triggers and destinations are version dependent, as each lambda version has its own ARN.

    You can test this in the lambda console without redeploying. If the version-hypothesis is correct, the destination will not appear in the "Function overview" section of the lambda console, UNLESS you first select the snapshotted version.