Diagnosing High Availability -- ActiveMQ Artemis

Is there a way to diagnose HA issues in ActiveMQ Artemis? I have a pair of shared-store servers that work really well. When I shut down the primary, the secondary takes over until it primary tells it it's back up, then the primary takes over and the secondary goes back to being a secondary.

I took the configuration and basically copied it to another pair of servers, but this one isn't working.

Everything looks fine, as far as I can tell. The cluster appears in the console, and the two servers connect. When I shut down the primary, the secondary logs this message:

2020-12-06 16:59:26,379 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to <Primary IP>/<Primary IP>:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED]

In the working pair, right after this message the secondary speedily deploys all my addresses and queues and takes over. But the new pair, the secondary does nothing after this.

I'm not sure where to start looking. I just keep comparing the configuration of the non-working pair with the working pair.

I'm using an NFS mount. The type of shared file is Azure's NetApp.

Here are my broker configurations. This is correct though because it works on the other pair...

Primary:

<connectors>
   <connector name="artemis">tcp://<primary URL>:61616</connector>
   <connector name="artemis-backup">tcp://<secondary URL>:61616</connector>
</connectors>

<cluster-user>activemq</cluster-user>
<cluster-password>artemis123</cluster-password>

<ha-policy>
   <shared-store>
      <master>
         <failover-on-shutdown>true</failover-on-shutdown>
      </master>
   </shared-store>
</ha-policy>

<cluster-connections>
   <cluster-connection name="cluster-1">
      <connector-ref>artemis</connector-ref>
      <static-connectors>
         <connector-ref>artemis-backup</connector-ref>
      </static-connectors>
   </cluster-connection>
</cluster-connections>

Secondary:

<connectors>
   <connector name="artemis-live">tcp://<primary URL>:61616</connector>
   <connector name="artemis">tcp://<secondary URL>:61616</connector>
</connectors>

<cluster-user>activemq</cluster-user>
<cluster-password>artemis123</cluster-password>

<ha-policy>
   <shared-store>
      <slave>
         <allow-failback>true</allow-failback>
         <failover-on-shutdown>true</failover-on-shutdown>
      </slave>
   </shared-store>
</ha-policy>

<cluster-connections>
   <cluster-connection name="cluster-1">
      <connector-ref>artemis</connector-ref>
      <static-connectors>
         <connector-ref>artemis-live</connector-ref>
      </static-connectors>
   </cluster-connection>
</cluster-connections>

Solution

In the shared-store configuration the backup broker continuously attempts to acquire a file lock on the journal. However, since the master broker already has the lock it won't be able to until the master dies. Therefore, I would look at the shared storage and ensure that file locking is working properly.

Since you're using NFS the NFS client configuration options are worth inspecting as well. Here are the configuration options I would recommend to enable reasonable fail-over times:

timeo=50 - NFS timeout of 5 seconds
retrans=1 - allows only one retry
soft - soft mounting the NFS share disables the retry forever logic, allowing NFS errors to pop up into application stack after above timeouts
noac - turns off caching of file attributes but also enforces a sync write to the NFS share. This also reduces the time for NFS errors to pop up.