Restart in Shared Store Filesystem for an ActiveMQ Artemis Cluster

In some circumstances we have to perform some maintenance on the NFSv4 mount point which is used by an Artemis MQ Shared store. Is there a safe way to do it without stopping the AMQ Cluster? By shutting down the shared store for some seconds (about 20 seconds) and making it available again, it can produce random outcomes. Sometimes it works, sometimes not.

The failure is related to file locks, and it is visible from the server logs as:

AMQ221034: Waiting indefinitely to obtain live lock

My idea is that if some messages are in-flow when the shared store is in maintenance, then Artemis will not be able to obtain the file lock again.

If that is true, maybe pausing all queues could work? (or anything which prevents artemis to write on the journal).

Solution

Unless there is some way to perform your maintenance on the NFSv4 share that is 100% transparent to clients like ActiveMQ Artemis then you really need to shut the brokers down to ensure reliable behavior.

The broker either reads from or writes to the disk for all sorts of operations including anything to do with durable messages, paging for an address, large messages, etc. The active broker also polls the shared store repeatedly to ensure it still has the lock. Trying to "pause" all these activities administratively simply isn't possible. Furthermore, doing so would essentially mean an outage for all the messaging clients anyway. The simplest solution is just to stop the brokers, perform your maintenance, and restart. If this down-time is not acceptable for whatever reason then I recommend investigating alternatives to your current NFS setup.