Using the Node SDK, I am seeing high latency of about 9000 ms to deliver SQS messages when messages are sent in parallel on multiple queues. If I sent all the messages sequentially on one queue, I get much lower latency of about 300 ms. In both tests, messages are sent in batches of 10 using the code below. The send timestamp is put inside the message itself, so I can measure the delivery time when the message gets received.
Why is it so much slower to transfer messages sent in parallel queues? Does Amazon do rate limiting?
for (i = 0; i < 10; i++) {
var entry = {
Id: String(messageCount),
MessageBody: "test"+messageCount+(new Date().getTime()),
MessageDeduplicationId: "test"+messagecount+" "+(new Date().getTime())
};
entries.push(entry);
messageCount += 1;
}
var params = {
QueueUrl: queueUrls[senderIndex],
Entries: entries
};
sqs.sendMessageBatch(params, function(batchSendErr, results) {
...
});
In the sequential case, I run one node program that sends 500 batches of 10 messages on a single queue, and one program that receives 500 batches of 10 messages. (500 messages total.)
In the parallel case, I run one node program that creates 250 writer threads and sends two batches of 10 messages from each to one of 250 different queues. I run another node program that created 250 reader threads and receives two batches of 10 messages from each. (5000 messages total)
For the tests above, I use FIFO queues, although results are similar with non-FIFO queues.
I do notice that in the parallel send case, each batch send call to the AWS SDK takes about 5 seconds to get a completion callback. In the sequential case, each batch send call to the AWS SDK takes about 300ms. I'm not sure why the API calls are slower in parallel unless AWS is rate limiting my calls.
BTW, here is what my queue setup looks like. There are 250 of these queues with #{item} ranging from 1 to 250
aws sqs create-queue --queue-name loadtest_device#{item}_user#{item}.fifo --attributes "FifoQueue=true,VisibilityTimeout=300,ReceiveMessageWaitTimeSeconds=0"
After discussing this issue with an Amazon support engineer, he showed that he could not reproduce the results. His theory was that the latency problem was caused by a network bottleneck, particularly WiFi traffic congestion as both test networks were using the same WiFi radio space.
I was never fully able to confirm as I did not have access to two independent wired network connections. I did witness that SQS send operations have a large and verbose data payload and do use significantly higher bandwidth than other operations (2x data overhead size vs lambda calls.) This is consistent with the network bottleneck theory.