I am using event-driven architecture for one of my projects. Amazon Simple Queue Service supports handling failures.
If a message was not successfully handled, it does not get to the part where I delete the message from the queue. If it's a one-time failure, it is handled graciously. However, if it is an erroneous message, it makes its way into DLQ.
My question is what should be happening with DLQs later on? There are thousands of those messages stuck in the DLQ. How are they supposed to be handled?
I would love to hear some real-life examples and engineering processes that are in place in some of the organizations.
"It depends!"
Messages would have been sent to the Dead Letter Queue because something didn't happen as expected. It might be due to a data problem, a timeout or a coding error.
You should:
Common causes can be database locks, network errors, programming errors and corrupt data.
It's probably a good idea to setup some sort of monitoring so that somebody investigates more quickly, rather than letting it accumulate to thousands of messages.