Search code examples
rabbitmqpollyretry-logicreliable-message-deliveryunreliable-connection

RabbitMQ - deal with unreliable service


I have a service AAA that posts 10 to 50 thousand messages a minute to a RabbitMQ exchange. A .NET Core service BBB subscribes to a queue (to which all messages are routed) and for each message calls another HTTP service CCC over Internet. The problem is CCC is very unreliable, a few times a day it goes completely off for a minute or two and at least once a week it dies for an hour.

I don't have control over AAA or CCC. How can I use RabbitMQ routing features to reliably deliver all messages trought?


Solution

  • For an unreliable third-party service CCC which goes offline for minutes or hours, a circuit-breaker can be useful. Configure the circuit-breaker to break when it detects CCC is offline.

    You can monitor the circuit-breaker state to detect when CCC is offline and/or log changes of circuit-state for later analysis.

    Polly's circuit-breaker allows you to hook in any custom code on transitions of circuit state, so you could also:

    • when the circuit breaks, unsubscribe from the RabbitMQ queue.
    • when the circuit half-opens, resubscribe to the RabbitMQ queue at narrow parallelism (say, with a pre-fetch count of only 1 or 2 ... only enough messages for the circuit-breaker to retry the circuit).
    • when the circuit closes (healthy again), resubscribe to the RabbitMQ queue at full throughput.

    This pattern would prevent you getting 100000s of messages flowing over to the RabbitMQ error/dead-letter/your custom retry queue, as soon as the circuit-breaker detects CCC is offline.

    You would still need to consider what happens to the messages that do fail (before the circuit breaks or while retesting it), as described in another answer. Direct them to an error/retry queue. Or if the unsubscribe-when-CCC-is-down pattern works well enough with your real world parameters, you may be able to let messages which fail simply return to the original queue.


    If CCC experiences also any transient faults (faulting for only a few seconds), consider introducing a WaitAndRetry policy.


    With incoming message rates potentially at 1000s per second, you likely want to consider how you are limiting parallelism of message processing within BBB and/or the timeout you set on calls to CCC. Without this you may risk memory bulges in the consumer as more and more messages arrive while other requests are hanging on a response from CCC before it times out; a high timeout on CCC clearly exacerbates this. Consumer parallelism can be limited by using manual ack and applying a pre-fetch count.