Rebus retry policy when RabbitMQ is temporarily down

I have a dockerized microservice architecture where I am using Rebus with RabbitMQ as message bus.

One container is running RabbitMQ. Other containers are running services that communicate with each other via Rebus/RabbitMQ.

I want my solution to be resilient to container restarts so if for example the RabbitMQ container restarts I expect the other services to be unaffected by that. I expect that messages sent while RabbitMQ is down are queued up for delivery by Rebus in the sending service and that they are delivered when the RabbitMQ connection is restored.

To verify that I run this test scenario:

Service A sends a message to service B via Rebus and RabbitMQ. That works fine.
I stop the RabbitMQ container.
Service A sends a message to service B via Rebus and RabbitMQ. That fails because RabbitMQ is unavailable.
I start the RabbitMQ container again.
I can see that Rebus in my services automatically reconnect to RabbitMQ when it is up. That is as expected.
Now that the RabbitMQ connection is restored I would expect that Rebus sends the pending message from Service A to service B, but it does not.

Is this not expected behaviour of Rebus? If not, can I enable this feature?

I have read this topic https://github.com/rebus-org/Rebus/wiki/Automatic-retries-and-error-handling and tried to configure Rbus like this:

Configure.With(...)
    .Options(b => b.SimpleRetryStrategy(maxDeliveryAttempts: 10))
    .(...)

but with no luck.

Solution

The "delivery attempts" you're configuring is how you configure how many Rebus should try to consume a received message before giving up (i.e. moving it to the error queue).

If Rebus loses its connection to the broker, it will not be able to receive anything for the entire duration of the outage, so stopping RabbitMQ should effectively pause all message processing (possibly with some exceptions in all messages being handled at the instant where RabbitMQ goes away).

Since no Rebus handlers will be running then, while RabbitMQ is down, you will have to deal with outgoing messages sent from other places, e.g. like messages sent/published from a web request.

(...) I expect that messages sent while RabbitMQ is down are queued up for delivery by Rebus (...)

...but Rebus cannot queue anything up, because RabbitMQ is down(*).

The natural thing to do for Rebus in this situation is to give you, the caller, the responsibility of deciding what to do about the problem.

In .NET, you usually do that by throwing an exception back at you. 🙂

This leaves you with the option of

performing some alternative action, or
retrying some more times, or
whatever makes sense in that particular situation

A simple approach to building some resilience into your system in this case would be to use something like Polly to try sending outgoing messages multiple times in cases where it could fail.

I hope that makes sense. Please let me know if anything needs to be elaborated on. 🙂

(*) Of course Rebus could have "cheated" and queued outgoing messages up in memory, but that would make it very hard for you to write resilient code, because you would not know whether an outgoing message had been safely delivered to the broker, or whether it was just sitting in memory waiting to be saved somewhere.