I have an aws lambda function. It will be triggered using an SQS. The lambda time is 15 mins and SQS queue visibility time is 16 mins. When the lambda is about to get timed out, I will put a continuation key inside the same queue and end the current instance of lambda. A new instance of lambda will start from where it left off using the SQS message. Everything was working fine until lambda put a message to SQS 16th time (execution).
SQS queue shows the message is in flight but lambda is not picking up the message. after 10 retries the message is pushed to DLQ. Also, there is no concurrency issue. What is the reason my lambda is not picking the SQS message after 15 times of execution?
I have been using lambda and SQS for more than 4 years but never faced anything like this. Not sure what I am missing.
Edit: The message retention in SQS is 4 days. You can do the following to recreate this issue. Create a sqs named test-sqs with visibility time as 30 seconds and add it as a trigger to a new lambda (lambda timeout is 20 seconds). Also add a dlq to the sqs queue. The following is the lambda code.
import json
import boto3
import time
from datetime import datetime
sqsClient = boto3.client('sqs')
SQS_URL = "https://sqs.ap-south-1.amazonaws.com/YOUR_ACCOUNT_NUMBER/test-sqs"
def lambda_handler(event, context):
if ("Records" in event) and (len(event["Records"]) > 0):
print("Trigger through SQS.")
for record in event["Records"]:
event = json.loads(record["body"])
else:
print("Triggered manually.")
print(event)
start_time = datetime.utcnow()
print("start", start_time)
time.sleep(1)
segment_number = 1
if "segment_number" in event:
segment_number = event["segment_number"]
if segment_number <= 20:
segment_number += 1
payload = {
"segment_number" : segment_number
}
sqsClient.send_message(QueueUrl = SQS_URL, MessageBody=json.dumps(payload))
else:
print("COMPLETED")
print("end", datetime.utcnow())
SQS and Lambda have recursion detection mechanisms. If it identifies that the messages are going into an infinite loop then it stops the execution. As of now, it stops on the 16th execution. That's the reason your Lambda is only running 15 times.
When it fails, it will exceed all its max receive count and add the message to DLQ. It emits RecursiveInvocationsDropped
Cloudwatch metric.
More about it can be found in the article here: https://aws.amazon.com/blogs/compute/detecting-and-stopping-recursive-loops-in-aws-lambda-functions/
It explains: