Search code examples
activemq-artemis

Spilt brain (Pinging the network) concept in primary and backup


Recently, I faced a latency issue where the backup server became active after the primary server was disconnected, resulting in both servers being active simultaneously (a split-brain issue).

To resolve this issue, I am implementing a network pinging concept. I have configured the primary broker to continuously ping the backup IP, and the backup broker to continuously ping the primary IP to avoid a split-brain scenario.

Could you please help me confirm if my configuration and understanding are correct?

Primary broker.xml:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>amq1</name>
      <persistence-enabled>true</persistence-enabled>
      <journal-type>ASYNCIO</journal-type>
      <paging-directory>data/paging</paging-directory>
      <bindings-directory>data/bindings</bindings-directory>
      <journal-directory>data/journal</journal-directory>
      <large-messages-directory>data/large-messages</large-messages-directory>
      <journal-datasync>true</journal-datasync>
      <journal-min-files>2</journal-min-files>
      <journal-pool-files>10</journal-pool-files>
      <journal-device-block-size>4096</journal-device-block-size>
      <journal-file-size>100M</journal-file-size>
      <journal-buffer-timeout>28000</journal-buffer-timeout>
      <journal-max-io>8192</journal-max-io>
      <disk-scan-period>5000</disk-scan-period>
      <max-disk-usage>100</max-disk-usage>
      <critical-analyzer>true</critical-analyzer>
      <critical-analyzer-timeout>1800000</critical-analyzer-timeout>
      <critical-analyzer-check-period>1800000</critical-analyzer-check-period>
      <critical-analyzer-policy>HALT</critical-analyzer-policy>
      <page-sync-timeout>1628000</page-sync-timeout>
      <global-max-size>10GB</global-max-size>

      <!-- Network Check Configuration -->
      <network-check-period>10000</network-check-period>
      <network-check-timeout>1000</network-check-timeout>
      <network-check-list>172.16.7.56</network-check-list>
      <network-check-ping-command>ping -n 1 -w %d %s</network-check-ping-command>
      <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>

      <connectors>
        <connector name="amq1">tcp://ppr-t24-qve-01:61616</connector>
        <connector name="amq2">tcp://ppr-t24-qve-02:61616</connector>
      </connectors>
      
      <acceptors>
         <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
         <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP,CORE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
      </acceptors>
      
      <broadcast-groups>
         <broadcast-group name="artemis-broadcast-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <broadcast-period>2000</broadcast-period>
            <connector-ref>amq1</connector-ref>
         </broadcast-group>
      </broadcast-groups>
      
      <discovery-groups>
         <discovery-group name="artemis-discovery-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
         </discovery-group>
      </discovery-groups>
      
      <cluster-user>admin</cluster-user>
      <cluster-password>admin</cluster-password>
      
      <cluster-connections>
         <cluster-connection name="artemis-cluster">
            <connector-ref>amq1</connector-ref>
            <retry-interval>1000</retry-interval>
            <retry-interval-multiplier>3</retry-interval-multiplier>
            <max-retry-interval>5000</max-retry-interval>
            <initial-connect-attempts>-1</initial-connect-attempts>
            <reconnect-attempts>-1</reconnect-attempts>
            <use-duplicate-detection>true</use-duplicate-detection>
            <message-load-balancing>STRICT</message-load-balancing>
            <max-hops>1</max-hops>
            <discovery-group-ref discovery-group-name="artemis-discovery-group"/>
         </cluster-connection>
      </cluster-connections>
      
      <ha-policy>
         <replication>
            <primary>
               <group-name>artemis-discovery-group</group-name>
               <check-for-active-server>true</check-for-active-server>
            </primary>
         </replication>
      </ha-policy>
      
      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>
      
      <addresses>
         <address name="exampleQueue">
            <anycast>
               <queue name="exampleQueue"/>
            </anycast>
         </address>
         <address name="DLQ">
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
      
      <address-settings>
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>
            <auto-delete-addresses>false</auto-delete-addresses>
            <auto-delete-jms-queues>false</auto-delete-jms-queues>
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <auto-create-dead-letter-resources>true</auto-create-dead-letter-resources>
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>
            <auto-delete-addresses>false</auto-delete-addresses>
            <auto-delete-jms-queues>false</auto-delete-jms-queues>
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <address-setting match="exampleQueue">
            <dead-letter-address>DLQ</dead-letter-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <max-size-bytes>-1</max-size-bytes>
            <page-size-bytes>1048576</page-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
         </address-setting>
      </address-settings>
   </core>
</configuration>

Backup broker.xml:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>amq2</name>
      <persistence-enabled>true</persistence-enabled>
      <journal-type>ASYNCIO</journal-type>
      <paging-directory>data/paging</paging-directory>
      <bindings-directory>data/bindings</bindings-directory>
      <journal-directory>data/journal</journal-directory>
      <large-messages-directory>data/large-messages</large-messages-directory>
      <journal-datasync>true</journal-datasync>
      <journal-min-files>2</journal-min-files>
      <journal-pool-files>10</journal-pool-files>
      <journal-device-block-size>4096</journal-device-block-size>
      <journal-file-size>100M</journal-file-size>
      <journal-buffer-timeout>28000</journal-buffer-timeout>
      <journal-max-io>8192</journal-max-io>
      <disk-scan-period>5000</disk-scan-period>
      <max-disk-usage>100</max-disk-usage>
      <critical-analyzer>true</critical-analyzer>
      <critical-analyzer-timeout>1800000</critical-analyzer-timeout>
      <critical-analyzer-check-period>1800000</critical-analyzer-check-period>
      <critical-analyzer-policy>HALT</critical-analyzer-policy>
      <page-sync-timeout>1628000</page-sync-timeout>
      <global-max-size>2GB</global-max-size>

      <!-- Network Check Configuration -->
      <network-check-period>10000</network-check-period>
      <network-check-timeout>1000</network-check-timeout>
      <network-check-list>172.16.7.55</network-check-list>
      <network-check-ping-command>ping -n 1 -w %d %s</network-check-ping-command>
      <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>

      <connectors>
        <connector name="amq1">tcp://ppr-t24-qve-01:61616</connector>
        <connector name="amq2">tcp://ppr-t24-qve-02:61616</connector>
      </connectors>
      
      <acceptors>
         <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;amqpMinLargeMessageSize=102400;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>
         <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP,CORE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpMinLargeMessageSize=102400;amqpDuplicateDetection=true</acceptor>
      </acceptors>
      
      <broadcast-groups>
         <broadcast-group name="artemis-broadcast-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <broadcast-period>2000</broadcast-period>
            <connector-ref>amq2</connector-ref>
         </broadcast-group>
      </broadcast-groups>
      
      <discovery-groups>
         <discovery-group name="artemis-discovery-group">
            <group-address>231.7.7.7</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
         </discovery-group>
      </discovery-groups>
      
      <cluster-user>admin</cluster-user>
      <cluster-password>admin</cluster-password>
      
      <cluster-connections>
         <cluster-connection name="artemis-cluster">
            <connector-ref>amq2</connector-ref>
            <retry-interval>1000</retry-interval>
            <retry-interval-multiplier>3</retry-interval-multiplier>
            <max-retry-interval>5000</max-retry-interval>
            <initial-connect-attempts>-1</initial-connect-attempts>
            <reconnect-attempts>-1</reconnect-attempts>
            <use-duplicate-detection>true</use-duplicate-detection>
            <message-load-balancing>STRICT</message-load-balancing>
            <max-hops>1</max-hops>
            <discovery-group-ref discovery-group-name="artemis-discovery-group"/>
         </cluster-connection>
      </cluster-connections>
      
      <ha-policy>
        <replication>
          <backup>
            <group-name>artemis-discovery-group</group-name>
            <allow-failback>true</allow-failback>
            <failback-delay>5000</failback-delay>
          </backup>
        </replication>
      </ha-policy>
      
      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>
      
      <addresses>
         <address name="exampleQueue">
            <anycast>
               <queue name="exampleQueue"/>
            </anycast>
         </address>
         <address name="DLQ">
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
      
      <address-settings>
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>
            <auto-delete-addresses>false</auto-delete-addresses>
            <auto-delete-jms-queues>false</auto-delete-jms-queues>
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <auto-create-dead-letter-resources>true</auto-create-dead-letter-resources>
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
            <auto-delete-queues>false</auto-delete-queues>
            <auto-delete-addresses>false</auto-delete-addresses>
            <auto-delete-jms-queues>false</auto-delete-jms-queues>
            <auto-delete-jms-topics>false</auto-delete-jms-topics>
         </address-setting>
         <address-setting match="exampleQueue">
            <dead-letter-address>DLQ</dead-letter-address>
            <redelivery-delay>1000</redelivery-delay>
            <max-delivery-attempts>3</max-delivery-attempts>
            <max-size-bytes>-1</max-size-bytes>
            <page-size-bytes>1048576</page-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
         </address-setting>
      </address-settings>
   </core>
</configuration>

Solution

  • This kind of configuration is exactly opposite of what you should do. As noted in the documentation:

    The server will stop itself if it can’t ping one or more of the addresses in the list.

    So, if the primary broker is pinging the machine hosting the backup broker and the machine hosting the backup broker goes down then the primary broker will stop meaning both brokers are now stopped. Likewise, if the backup broker is pinging the machine hosting the primary broker and the machine hosting the primary broker goes down then the backup broker won't activate. This essentially invalidates your HA configuration.

    I recommend you integrate with ZooKeeper in your environment or switch to a share store configuration.