I am utilising spring cloud aws messaging (2.0.1.RELEASE
) in java to consume from an SQS queue. If it's relevant we use default settings, java 10 and spring cloud Finchley.SR2
,
We recently had an issue where a message could not be processed due to an application bug, leading to an exception and no confirmation (deletion) of the message. The message is later retried (this is desirable) presumably after the visibility timeout has elapsed (again default values are in use), we have not customised the settings here.
We didn't spot the error above for a few days, meaning the message receive count was very high and the message had conceptually been on the queue for a while (several days by now). We considered creating a cloud watch SQS alarm to alert us to a similar situation in future. The only suitable metric appeared to be ApproximateAgeOfOldestMessage
.
Sadly, when observing this metric I see this:
The max age doesn't go much above 5 mins (despite me knowing it was several days old). If a message is getting older each time a receive happens, assuming no acknowledgment comes and the message isn't deleted - but is instead becoming available again after the visibility timeout has elapsed should this graph not be much much higher?
I don't know if this is something specific to thew way that spring cloud aws messaging consumes the message or whether it's a general SQS quirk, but my expectation was that if a message was put on the queue 5 days ago, and a consumer had not successfully consumed the message then the max age would be 5 days?
Is it in fact the case that if a message is received by a consumer, but not ultimately deleted that the max age is actually the length between consume calls?
Can anyone confirm whether my expectation is incorrect, i.e. this is indeed how SQS is expected to behave (it doesn't consider the age to be the duration of time since the message was first put on the queue, but instead considers it to be the time between receive calls?
Based on a similar question on AWS forums, this is apparently a bug with regular SQS queues where only a single message is affected.
In order to have a useful alarm for this issue, I would suggest setting up a dead-letter-queue (where messages get automatically delivered after a configurable number of consume-without-deletes), and alarm on the size of the dead-letter-queue (ApproximateNumberOfMessagesVisible).