Search code examples
high-availabilityactivemq-artemis

High CPU usage on idle AMQ Artemis cluster, related to locks with shared-store HA


I have AMQ Artemis cluster, shared-store HA (master-slave), 2.17.0.

I noticed that all my clusters (active servers only) that are idle (no one is using them) using from 10% to 20% of CPU, except one, which is using around 1% (totally normal). I started investigating...

Long story short - only one cluster has a completely normal CPU usage. The only difference I've managed to find that if I connect to that normal cluster's master node and attempt telnet slave 61616 - it will show as connected. If I do the same in any other cluster (that has high CPU usage) - it will show as rejected.

In order to better understand what is happening, I enabled DEBUG logs in instance/etc/logging.properties. Here is what master node is spamming:

2021-05-07 13:54:31,857 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Backup is not active, trying original connection configuration now.
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying reconnection attempt 0/1
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying to connect with connectorFactory = org.apache.activemq.artemis.core.remoting.impl.netty$NettyConnectorFactory@6cf71172, connectorConfig=TransportConfiguration(name=slave-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&port=61616&keyStorePassword=****&sslEnabled=true&host=slave-com&trustStorePath=/path/to/ssl/truststore-jks&keyStorePath=/path/to/ssl/keystore-jks
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Connector NettyConnector [host=slave.com, port=61616, httpEnabled=false$ httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] using native epoll
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client] AMQ211002: Started EPOLL Netty Connector version 4.1.51.Final to slave.com:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Remote destination: slave.com/123.123.123.123:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.spi.core.remoting.ssl.SSLContextFactory] Creating SSL context with configuration
trustStorePassword=****
port=61616
keyStorePassword=****
sslEnabled=true
host=slave.com
trustStorePath=/path/to/ssl/truststore.jks
keyStorePath=/path/to/ssl/keystore.jks
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Added ActiveMQClientChannelHandler to Channel with id = 77c078c2
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Connector towards NettyConnector [host=slave.com, port=61616, httpEnabled=false, httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] failed

This is what slave is spamming:

2021-05-07 14:06:53,177 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 1
2021-05-07 14:06:53,178 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] failed to lock position: 1

If I attempt to telnet from master node to slave node (same if I do it from slave to slave):

[root@master]# telnet slave.com 61616
Trying 123.123.123.123...
telnet: connect to address 123.123.123.123: Connection refused

However if I attempt the same telnet in that the only working cluster, I can successfully "connect" from master to slave...


Here is what I suspect:

  1. Master acquires lock in instance/data/journal/server.lock
  2. Master keeps trying to connect to slave server
  3. Slave unable to start, because it cannot acquire the same server.lock on shared storage.
  4. Master uses high CPU because of such hard-trying to connect to slave, which is not running.

What am I doing wrong?

EDIT: This is how my NFS mounts look like (taken from mount command):

some_server:/some_dir on /path/to/artemis/instance/data type nfs4 (rw,relatime,sync,vers=4.1,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=123.123.123.123,local_lock=none,addr=123.123.123.123)

Solution

  • Turns out issue was in broker.xml configuration. In static-connectors I somehow decided to list only a "non-current server" (e.g. I have srv0 and srv1 - in srv0 I only added connector of srv1 and vice versa).

    What it used to be (on 1st master node):

    <cluster-connections>
        <cluster-connection name="abc">
            <connector-ref>srv0-connector</connector-ref>
            <message-load-balancing>ON_DEMAND</message-load-balancing>
            <max-hops>1</max-hops>
            <static-connectors>
               <connector-ref>srv1-connector</connector-ref>
            </static-connectors>
        </cluster-connection>
    </cluster-connections>
    

    How it is now (on 1st master node):

    <cluster-connections>
        <cluster-connection name="abc">
            <connector-ref>srv0-connector</connector-ref>
            <message-load-balancing>ON_DEMAND</message-load-balancing>
            <max-hops>1</max-hops>
            <static-connectors>
               <connector-ref>srv0-connector</connector-ref>
               <connector-ref>srv1-connector</connector-ref>
            </static-connectors>
        </cluster-connection>
    </cluster-connections>
    

    After listing all cluster's nodes, the CPU normalized and it's not only ~1% on active node. The issue is totally not related AMQ Artemis connections spamming or file locks.