Search code examples
multithreadingreactive-programmingmessage-queuemicroservicesevent-driven-design

How to handle in Event Driven Microservices if the messaging queue is down?


Assume there are two services A and B, in a microservice environment.

In between A and B sits a messaging queue M that is a broker.

A<---->'M'<----->B

The problem is what if the broker M is down?

Possible Solution i can think of: Ping from Service A at regular intervals to check on Messaging queue M as long as it is down. In the meantime, service A stores the data in a local DB and dumps it into the queue once the broker M is up.

Considering the above problem, if someone can suggest whether threads or reactive programming is best suited for this scenario and ways it could be handled via code, I would be grateful.


Solution

  • The problem is what if the broker M is down?

    If the broker is down, then A and B can't use it to communicate.

    What A and B should do in that scenario is going to depend very much on the details of your particular application/use-case.

    Is there useful work they can do in that scenario?

    If not, then they might as well just stop trying to handle any work/transactions for the time being, and instead just sit and wait for M to come back up. Having them do periodic pings/queries of M (to see if it's back yet) while in this state is a good idea.

    If they can do something useful in this scenario, then you can have them continue to work in some sort of "offline mode", caching their results locally in anticipation of M's re-appearance at some point in the future. Of course this can become problematic, especially if M doesn't come back up for a long time -- e.g.

    • what if the set of cached local results becomes unreasonably large, such that A/B runs out of space to store it?
    • Or what if A and B cache local results that will both apply to the same data structure(s) within M, such that when M comes back online, some of A's results will overwrite B's (or vice-versa, depending on the order in which they reconnect)? (This is analogous to the sort of thing that source-code-control servers have to deal with after several developers have been working offline, both making changes to the same lines in the same file, and then they both come back online and want to commit their changes to that file. It can get a bit complex and there's not always an obvious "correct" way to resolve conflicts)
    • Finally what if it was something A or B "said" that caused M to crash in the first place? In that case, re-uploading the same requests to M after it comes back online might only cause it to crash again, and so on in an infinite loop, making the service perpetually unusable. (In this case, of course, the proper fix would be to debug M)

    Another approach might be to try to avoid the problem by having multiple redundant brokers (e.g. M1, M2, M3, ...) such that as long as at least one of them is still available, productive work can continue. Or perhaps allow A and B to communicate with each other directly rather than through an intermediary.

    As for whether this sort of thing would best be handled by threads or reactive programming, that's a matter of personal preference -- personally I prefer reactive programming, because the multiple-threads style usually means blocking-RPC-operations, and a thread that is blocked inside a blocking-operation is a frozen/helpless thread until the remote party responds (e.g. if M takes 2 minutes to respond to an RPC request, then A's RPC call to M cannot return for 2 minutes, which means that the calling thread is unable to do anything at all for 2 minutes). In a reactive approach, A's thread could also be doing other things during this period (such as pinging M to make sure it's okay, or contacting a backup broker, or whatever) during that 2 minute period if it wanted to.