Search code examples
activemq-artemis

Getting Latency issue in the backup Server


ActiveMQ Artemis primary broker logs:

2024-05-25 02:00:52,946 WARN  [org.apache.activemq.artemis.core.server] AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.

ActiveMQ Artemis backup broker logs:

2024-05-25 02:01:00,230 INFO  [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: PrimaryFailoverQuorumVote
2024-05-25 02:01:00,230 INFO  [org.apache.activemq.artemis.core.server] AMQ221084: Requested 0 quorum votes
2024-05-25 02:01:00,231 INFO  [org.apache.activemq.artemis.core.server] AMQ221083: ignoring quorum vote as max cluster size is 1.
2024-05-25 02:01:00,231 INFO  [org.apache.activemq.artemis.core.server] AMQ221071: Failing over based on quorum vote results.
2024-05-25 02:01:00,242 INFO  [org.apache.activemq.artemis.core.server] AMQ221037: ActiveMQServerImpl::name=amq2 to become 'active'
2024-05-25 02:01:02,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address exampleQueue supporting [ANYCAST]
2024-05-25 02:01:02,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue exampleQueue on address exampleQueue
2024-05-25 02:01:02,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address DLQ supporting []
2024-05-25 02:01:02,167 WARN  [org.apache.activemq.artemis.core.server] AMQ222274: Failed to deploy address DLQ: AMQ229209: Can't remove routing type MULTICAST, queues exists for address: DLQ. Please delete queues before removing this routing type.
2024-05-25 02:01:02,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address ExpiryQueue supporting [ANYCAST]
2024-05-25 02:01:02,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue ExpiryQueue on address ExpiryQueue
2024-05-25 02:01:02,183 WARN  [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its primary node. nodeID=3468b7ea-0e82-11ef-922f-0050568e2645
2024-05-25 02:01:02,199 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now active
2024-05-25 02:01:02,199 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
2024-05-25 02:01:02,214 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
2024-05-25 02:01:02,214 WARN  [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its primary node.nodeID=3468b7ea-0e82-11ef-922f-0050568e2645
2024-05-25 02:01:02,214 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:5672 for protocols [CORE,AMQP]
2024-05-25 02:01:02,230 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:1883 for protocols [MQTT,CORE]
2024-05-25 02:01:02,246 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:61613 for protocols [CORE,STOMP]
2024-05-25 02:01:04,197 WARN  [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its primary node. nodeID=3468b7ea-0e82-11ef-922f-0050568e2645

We are experiencing latency issues in backup those error receivied in the primary server, but there are no issues with the backup server. When the backup server detects a problem with the primary server, it becomes active. However, the primary server also remains active, leading to a conflict as both servers are active simultaneously. This results in the broker not responding to requests properly

Primary broker.xml:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>amq1</name>
      <persistence-enabled>true</persistence-enabled>
      <journal-type>ASYNCIO</journal-type>
      <paging-directory>data/paging</paging-directory>
      <bindings-directory>data/bindings</bindings-directory>
      <journal-directory>data/journal</journal-directory>
      <large-messages-directory>data/large-messages</large-messages-directory>
      <journal-datasync>true</journal-datasync>
      <journal-min-files>2</journal-min-files>
      <journal-pool-files>10</journal-pool-files>
      <journal-device-block-size>4096</journal-device-block-size>
      <journal-file-size>100M</journal-file-size>
      <journal-buffer-timeout>28000</journal-buffer-timeout>
      <journal-max-io>8192</journal-max-io>
      <disk-scan-period>5000</disk-scan-period>
      <max-disk-usage>100</max-disk-usage>
      <critical-analyzer>true</critical-analyzer>
      <critical-analyzer-timeout>150000</critical-analyzer-timeout>
      <critical-analyzer-check-period>60000</critical-analyzer-check-period>
      <critical-analyzer-policy>HALT</critical-analyzer-policy>
      <page-sync-timeout>1628000</page-sync-timeout>
      <global-max-size>2GB</global-max-size>

      <connectors>
        <connector name="amq1">tcp://pro-t24-qve-01:61616</connector>
        <connector name="amq2">tcp://pro-t24-qve-02:61616</connector>
      </connectors>
       <acceptors>
           <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
           <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP,CORE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
       </acceptors>
       <broadcast-groups>
           <broadcast-group name="artemis-broadcast-group">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <broadcast-period>2000</broadcast-period>
               <connector-ref>amq1</connector-ref>
           </broadcast-group>
       </broadcast-groups>
       <discovery-groups>
           <discovery-group name="artemis-discovery-group">
               <group-address>231.7.7.7</group-address>
               <group-port>9876</group-port>
               <refresh-timeout>10000</refresh-timeout>
           </discovery-group>
       </discovery-groups>
      <cluster-user>admin</cluster-user>
      <cluster-password>admin</cluster-password>
      <cluster-connections>
         <cluster-connection name="artemis-cluster">
            <connector-ref>amq1</connector-ref>
            <retry-interval>1000</retry-interval>
            <retry-interval-multiplier>3</retry-interval-multiplier>
            <max-retry-interval>5000</max-retry-interval>
            <initial-connect-attempts>-1</initial-connect-attempts>
            <reconnect-attempts>-1</reconnect-attempts>
            <use-duplicate-detection>true</use-duplicate-detection>
            <message-load-balancing>STRICT</message-load-balancing>
            <max-hops>1</max-hops>
             <discovery-group-ref discovery-group-name="artemis-discovery-group"/>
         </cluster-connection>
      </cluster-connections>
      
      <!-- Other config -->
      <ha-policy>
        <replication>
          <primary>
            <group-name>artemis-discovery-group</group-name>
            <check-for-active-server>true</check-for-active-server>
          </primary>
        </replication>
      </ha-policy>

      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <!-- we need this otherwise ./artemis data imp wouldn't work -->
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>
      <addresses>
         <address name="exampleQueue">
            <anycast>
               <queue name="exampleQueue"/>
            </anycast>
         </address>
         <address name="DLQ">
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
      <address-settings>
         <!-- if you define auto-create on certain queues, management has to be auto-create -->
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>   
            <auto-delete-addresses>false</auto-delete-addresses> 
            <auto-delete-jms-queues>false</auto-delete-jms-queues> 
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <!--default for catch all-->
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <auto-create-dead-letter-resources>true</auto-create-dead-letter-resources>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>   
            <auto-delete-addresses>false</auto-delete-addresses> 
            <auto-delete-jms-queues>false</auto-delete-jms-queues> 
            <auto-delete-jms-topics>false</auto-delete-jms-topics>  
         </address-setting>
         <address-setting match="exampleQueue">            
            <dead-letter-address>DLQ</dead-letter-address>                      
            <redelivery-delay>1000</redelivery-delay>    
            <max-delivery-attempts>3</max-delivery-attempts>
            <max-size-bytes>-1</max-size-bytes>
            <page-size-bytes>1048576</page-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
         </address-setting>
      </address-settings>
   </core>
</configuration>

Backup broker.xml:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>amq2</name>
      <persistence-enabled>true</persistence-enabled>
      <journal-type>ASYNCIO</journal-type>
      <paging-directory>data/paging</paging-directory>
      <bindings-directory>data/bindings</bindings-directory>
      <journal-directory>data/journal</journal-directory>
      <large-messages-directory>data/large-messages</large-messages-directory>
      <journal-datasync>true</journal-datasync>
      <journal-min-files>2</journal-min-files>
      <journal-pool-files>10</journal-pool-files>
      <journal-device-block-size>4096</journal-device-block-size>
      <journal-file-size>100M</journal-file-size>
      <journal-buffer-timeout>28000</journal-buffer-timeout>
      <journal-max-io>8192</journal-max-io>
      <disk-scan-period>5000</disk-scan-period>
      <max-disk-usage>100</max-disk-usage>
      <critical-analyzer>true</critical-analyzer>
      <critical-analyzer-timeout>150000</critical-analyzer-timeout>
      <critical-analyzer-check-period>60000</critical-analyzer-check-period>
      <critical-analyzer-policy>HALT</critical-analyzer-policy>
      <page-sync-timeout>1628000</page-sync-timeout>
      <global-max-size>2GB</global-max-size>

      <connectors>
         <connector name="amq1">tcp://pro-t24-qve-01:61616</connector>
         <connector name="amq2">tcp://pro-t24-qve-02:61616</connector>
      </connectors>
      <acceptors>
         <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
         <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP,CORE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
         <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP,CORE;useEpoll=true</acceptor>
         <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>
         <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT,CORE;useEpoll=true</acceptor>
      </acceptors>
      <broadcast-groups>
         <broadcast-group name="artemis-broadcast-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <broadcast-period>2000</broadcast-period>
            <connector-ref>amq2</connector-ref>
         </broadcast-group>
      </broadcast-groups>
      <discovery-groups>
         <discovery-group name="artemis-discovery-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
         </discovery-group>
      </discovery-groups>
      <cluster-user>admin</cluster-user>
      <cluster-password>admin</cluster-password>
      <cluster-connections>
         <cluster-connection name="artemis-cluster">
            <connector-ref>amq2</connector-ref>
            <retry-interval>1000</retry-interval>
            <retry-interval-multiplier>3</retry-interval-multiplier>
            <max-retry-interval>5000</max-retry-interval>
            <initial-connect-attempts>-1</initial-connect-attempts>
            <reconnect-attempts>-1</reconnect-attempts>
            <use-duplicate-detection>true</use-duplicate-detection>
            <message-load-balancing>STRICT</message-load-balancing>
            <max-hops>1</max-hops>
             <discovery-group-ref discovery-group-name="artemis-discovery-group"/>
         </cluster-connection>
      </cluster-connections>
      
      <!-- Other config -->
      <ha-policy>
         <replication>
            <backup>
               <group-name>artemis-discovery-group</group-name>
               <allow-failback>true</allow-failback>
               <failback-delay>5000</failback-delay>
            </backup>
         </replication>
      </ha-policy>

      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <!-- we need this otherwise ./artemis data imp wouldn't work -->
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>
      <addresses>
         <address name="exampleQueue">
            <anycast>
               <queue name="exampleQueue"/>
            </anycast>
         </address>
         <address name="DLQ">
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
      <address-settings>
         <!-- if you define auto-create on certain queues, management has to be auto-create -->
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>   
            <auto-delete-addresses>false</auto-delete-addresses> 
            <auto-delete-jms-queues>false</auto-delete-jms-queues> 
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <!--default for catch all-->
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <auto-create-dead-letter-resources>true</auto-create-dead-letter-resources>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>   
            <auto-delete-addresses>false</auto-delete-addresses> 
            <auto-delete-jms-queues>false</auto-delete-jms-queues> 
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <address-setting match="exampleQueue">            
            <dead-letter-address>DLQ</dead-letter-address>                      
            <redelivery-delay>1000</redelivery-delay>    
            <max-delivery-attempts>3</max-delivery-attempts>
            <max-size-bytes>-1</max-size-bytes>
            <page-size-bytes>1048576</page-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
         </address-setting>
      </address-settings>
   </core>
</configuration>

Solution

  • What you're observing is called "split brain" and it is discussed at considerable length in the documentation. It appears that you haven't configured anything to mitigate split brain in your deployment. I recommend you review the documentation and update your deployment accordingly. If you have additional questions not addressed by the documentation please let us know.

    To be clear, it's impossible to say what the root cause of the latency problem is given the information provided. It may be a network issue, a hardware issue, etc. The latency limit referenced by the AMQ222207 message is controlled by the value of <call-timeout> set on the <cluster-connection> defined in broker.xml which is responsible for joining the primary and backup server. This value defaults to 30,000 milliseconds. If you feel that this is too low for your environment and/or use-case then you can increase it, but keep in mind that a response must be received by the primary broker from the backup before it will respond to the producer so high latency between the primary and backup will bubble up to the clients producing the messages.