The setup is pretty simple:
backend service sends a message to SQS -> SQS triggers lambda -> lambda code is executed
I added a log right after the backend service sent the message to SQS.
Then I viewed the logs for the lambda using CloudWatch.
To my surprise there is a gap of around 10 seconds between the backend service log and first lambda log, which means that SQS took around 10s to receive, process and send the message out again.
Question
Is it normal that SQS takes around 10s to process an incoming message? If yes, can it be made faster?
UPDATE
Thanks to the comment I was able to track this down a bit more. In my case we use TypeScript CDK and the lambda is created with following SqsEventSource
:
import eventSources = require("@aws-cdk/aws-lambda-event-sources")
const eventSource = new eventSources.SqsEventSource(
sqs.Queue.fromQueueAttributes(scope, "SomeId", {
queueArn: "someArn",
queueName: "someQueueName",
}),
{
batchSize: 100,
maxBatchingWindow: cdk.Duration.seconds(5),
},
)
As far as I understand
batchSize: 100
andmaxBatchingWindow: cdk.Duration.seconds(5)
together essentially mean: "Execute lambda immediately if there are 100 events or at latest after 5s".
So 10s is still 5s too much (it sometimes even takes 15-20s).
Is this a bug?
UPDATE 2
To make the hole process more testable I decided to directly send the message to the queue via CLI (so the backend service is not relevant from here anymore):
aws sqs --endpoint <endpoint> send-message --queue-url <queue-url> --message-body '{"test":"test1"}'
I executed the above command 10 times in a row (it took max. 7s to execute all of them) and then checked the lambda logs in CloudWatch:
(I could check how many messages were processed by each request with application logs, which are not visible here)
So as you can see there is a gap of 16s between the first and second lambda execution (the rest looks fine). Since I sent all 10 SQS message within max. 7 seconds, that should never be possible. There should be a gap of maximum 5 seconds or worst 10s.
What could be the cause here? Is this a bug?
Side note:
I actually came across this because a system test of my application was failing since it was so slow. For the system test it happens like 50% of the time that it's slow and the other 50% it's working as expected.
I could finally identify the issue for me, although I could not identify why there is 16s diff as seen in my question.
The general problem is that if you execute a lambda function with a batch of SQS messages and one of those messages can't be processed (-> lambda throws an error) all SQS messages are retryed again with a delay of few seconds.
So if there is a valid SQS message in the same batch as an invalid/failing message it might not be processed at all because the failing message is processed first. That can happen multiple times in a row which led to delays of more than 1 minute in my case.
Unfortunatly there is no way to tell AWS that few messages of a batch are valid and others aren't. If the lambda throws an error all messages in the batch are retryed always.
There are some workarounds and I actually chose to surround the processing of each SQS message with a try/catch
and only throw an error after all messages have been processed and if one of them failed previously. This still means that successful messages are retried but at least they are processed in time (note that your lambda needs to be idempotent for this approach). I would not suggest this workarround at all and it was just a hotfix with a todo comment :)