We are currently implementing a distributed Spring Boot microservice architecture on Amazon's AWS, where we use SNS/SQS as our messaging system:
Events are published by a Spring Boot service to an SNS FIFO topic using Spring Cloud AWS. The topic hands over the events to multiple SQS queues subscribed to the topic, and the queues are then in turn consumed by different consumer services (again Spring Boot using Spring Cloud AWS).
Everything works as intended, but we are sometimes seeing very high latency on our production services.
Our product isn't released yet (we are currently in testing), meaning we have very, very low traffic on prod, i.e., only a few messages a day.
Unfortunately, we see very high latency until a message is delivered to its subscribers after a long period of inactivity (typically up to 6 seconds, but can be as high as 60 seconds). Things speed up considerably afterwards with message delivery times dropping to below 100ms for the next messages being sent to the topic.
Turning on logging on the SNS topic in AWS revealed that most of the delay for the first message is spent at the SNS part of things, where the SNS dwellTime
roughly correlates with the delays we are seeing in message delivery. Spring Cloud AWS seems fine.
Is this something expected? Is there something like a "cold startup" time for idle SNS FIFO topics (as seen when using AWS lambdas)? Will this latency simply go away once we increase the load and heat up the topic? Or is there something we missed configuring?
We are using fairly standard SQS subscriptions, btw, no subscription throttling in place. The Spring Boot services run on a Fargate ECS cluster.
Seems like AWS inactivates unused SNS topics somehow. What we are doing now is, we are sending a "dummy" Keep-Alive message to the topic every ten minutes, which keeps the dwellTime
reasonably low for us (<500ms)