How to use discovery groups in ActiveMQ Artemis client connection?

I need some help with the correct setting for using the client connection to a ActiveMQ Artemis broker cluster.

My goal here is to run a performance test against what I believe is our Artemis cluster. While testing I would like to kill some nodes and expect the performance test to continue.

First the cluster setup:

We are about to migrate to the new Artemis Cloud and have deployed the Artemis operator in our Kubernetes cluster. It is configured to run with three nodes at the moment.

broker.xml:

<connectors>
  <connector name="artemis">tcp://msc-test-ss-0.msc-test-hdls-svc.it.svc.cluster.local:61616</connector>
</connectors>

<acceptors>
  <acceptor name="msc-test">
    tcp://msc-test-ss-0.msc-test-hdls-svc.it.svc.cluster.local:61616?protocols=AMQP,CORE,HORNETQ,MQTT,OPENWIRE,STOMP;tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;useEpoll=true;amqpCredits=1000;amqpMinCredits=300
  </acceptor>
</acceptors>

<cluster-user>SOME_USER</cluster-user>

<cluster-password>SOME_PASSWORD</cluster-password>

<broadcast-groups>
  <broadcast-group name="my-broadcast-group">
    <jgroups-file>jgroups-ping.xml</jgroups-file>
    <jgroups-channel>activemq_broadcast_channel</jgroups-channel>
    <connector-ref>artemis</connector-ref>
  </broadcast-group>
</broadcast-groups>

<discovery-groups>
  <discovery-group name="my-discovery-group">
    <jgroups-file>jgroups-ping.xml</jgroups-file>
    <jgroups-channel>activemq_broadcast_channel</jgroups-channel>
    <refresh-timeout>10000</refresh-timeout>
  </discovery-group>
</discovery-groups>

<cluster-connections>
  <cluster-connection name="my-cluster">
    <connector-ref>artemis</connector-ref>
    <retry-interval>1000</retry-interval>
    <retry-interval-multiplier>2</retry-interval-multiplier>
    <max-retry-interval>32000</max-retry-interval>
    <initial-connect-attempts>20</initial-connect-attempts>
    <reconnect-attempts>10</reconnect-attempts>
    <use-duplicate-detection>true</use-duplicate-detection>
    <message-load-balancing>ON_DEMAND</message-load-balancing>
    <max-hops>1</max-hops>
    <discovery-group-ref discovery-group-name="my-discovery-group"/>
  </cluster-connection>
</cluster-connections>

jgroups-ping.xml:

<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="urn:org:jgroups"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd"
        >
    <TCP bind_addr="${jgroups.bind_addr:site_local}"
         bind_port="${jgroups.bind_port:7800}"
         external_addr="${jgroups.external_addr}"
         external_port="${jgroups.external_port}"
         thread_pool.min_threads="0"
         thread_pool.max_threads="200"
         thread_pool.keep_alive_time="30000"/>
    <RED/>

    <dns.DNS_PING
            dns_query="msc-test-ping-svc"
            dns_record_type="${DNS_RECORD_TYPE:A}" />

    <MERGE3  min_interval="10000"
             max_interval="30000"/>
    <FD_SOCK2/>
    <FD_ALL3 timeout="40000" interval="5000" />
    <VERIFY_SUSPECT2 timeout="1500"  />
    <pbcast.NAKACK2 use_mcast_xmit="false" />
    <UNICAST3 />
    <pbcast.STABLE desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="2000"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
</config>

Now the command I am testing with:

/var/lib/artemis-instance/bin/artemis perf client --user user --password password --show-latency --url="tcp://msc-test-hdls-svc:61616?reconnectAttempts=5&confirmationWindowSize=20000" --consumer-url="tcp://msc-test-hdls-svc:61616?reconnectAttempts=5" queue://performance #--hdr case-1-new.hdr

Running the command outputs some performance stats and it seems that the client has successfully connected to a broker:

Connection brokerURL = tcp://msc-test-hdls-svc:61616?reconnectAttempts=5&confirmationWindowSize=20000

--- warmup false 
--- sent:           281 msg/sec 
--- blocked:        281 msg/sec 
--- completed:      281 msg/sec 
--- received:       282 msg/sec 
--- send ack time:   mean:   3325.25 us - 50.00%:    679.00 us - 90.00%:   1687.00 us - 99.00%:  81407.00 us - 99.90%:  83455.00 us - 99.99%:  83455.00 us - max:     83455.00 us 
--- transfer time:   mean:   3402.32 us - 50.00%:    787.00 us - 90.00%:   4015.00 us - 99.00%:  78335.00 us - 99.90%:  83455.00 us - 99.99%:  83455.00 us - max:     83455.00 us

Now when killing a node, I receive the following warning:

2023-06-14 12:56:47,361 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to msc-test-hdls-svc/10.4.1.113:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED]

After that stats keep being printed, but are all 0 msg/sec.

Then after some more time:

2023-06-14 12:57:52,312 WARN  [org.apache.activemq.artemis.core.client] AMQ212036: Can not find packet to clear: 3 last received command id first stored command id 0

And the pod running the command is restarted.

My expectation here is that the client will attempt to connect to a different node now after it discovered that one node has been killed.

Either my expectations are wrong or something in my setup isn't working. Any help?

EDIT: I also tried using multiple client IPs in the connection string, but the result was the same.

Solution

The issue here is that you don't have high availability configured which means that when a broker fails the client will not fail-over to a backup.

In 2.29.0 this will be different as fail-over to another live broker will be supported via ARTEMIS-4251.