Search code examples
amazon-web-serviceseventsqueueamazon-sqs

How to handle Dead Letter Queues in Amazon SQS?


I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.

If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.

My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?

I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.


Solution

  • "It depends!"

    Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.

    You should:

    • Start examining the messages that went to the Dead Letter Queue
    • Try and re-process the messages to determine the underlying cause of the failure (but sometimes it is a random failure that you cannot reproduce)
    • Once a cause is found, update the system to handle that particular use-case, then move onto the next cause

    Common causes can be database locks, network errors, programming errors and corrupt data.

    It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.