Search code examples
kubernetesactivemq-artemishigh-availabilityjgroupsartemiscloud

Artemis live-only cluster scale-down fails to move messages


We are moving Artemis from a VM to K8s. As a testament to the resilience of the broker itself, the VM-based broker rarely if ever abends. Typically, it runs continuously for our full 3-month release cycle.

In K8s, we have a StatefulSet of 3 clustered live-only Artemis broker pods, each with its own persistent volume. The PV covers a (highly unlikely) broker crash, delivering the messages upon restart. No shutdown delay incurred, only startup. In K8s, rolling restarts (graceful broker shutdown) are likely, due to K8s moving pods around for resource reallocation, possible monthly K8s patches / pod base-image OS upgrades, or changing broker config settings. Pod scale-down is also a graceful broker shutdown. The Artemis broker isn't aware of any difference between a K8s graceful shutdown triggered by a rolling restart or pod scale-down.

We are looking for the least message delivery delay by enabling Artemis scale-down. This avoids pod restart delays by redistributing the gracefully-shuting-down broker's messages to another live broker. A graceful Artemis pod restart (shutdown + startup) to get to "AMQ221007: Server is now live" can take over a minute.

However, there's an open Artemis scale-down bug we are encountering: Scale-down fails when using same discovery-group used by Broker cluster connection. It describes a possible workaround

JGroups channel used by scale-down is probably the same used by broker, but already being closed during broker shutdown itself.

As a workaround, it is possible to create a separate discovery-group (with its own broadcast-group) so that scale-down uses a new JGroups channel not being closed by broker. However, this causes duplication of configurations and a new JGroups port for the scale-down discovery must be opened.

In that ticket, I asked if example workaround files exist, as my attempt (included in comment) was unsuccessful. Asking here for more visibility and a broader group than buried in a bug comment. Is the workaround viable?

Additional info

  • Artemis 2.23.0
  • Client params: reconnectAttempts=15 and initialConnectAttempts=15. Enough to cover the shutdown of the broker. Have not yet quantified the additional time to redistribute messages, but expected to be small
  • jgroups-kubernetes used, with useNotReadyAddresses="false"
  • When we started this migration effort, ArtemisCloud.io was not out of beta. Now taking another look at it, it appears they don't use Artemis scale-down, instead using their own scale-down controller.

Solution

  • We were unable to resolve the issues so we pivoted to ArtemisCloud, which we successfully deployed to production. It came with it's own set of issues, mainly due to fitting it into our exiting infrastructure

    • Vault secrets retrieval on broker pod startup
    • Logs scraped to ELK via filebeat and stunnel