We use IBM MQ for some integration between various micro services. The applications are quite critical and we aim for zero down times. We have a cluster of three Queue Managers each one running on a different server (on a separate AWS availability zones) say QM1 on sever1, QM2 on server2 and QM3 on server3
.
We configure three ConnectionFactory
like below (note the connection name list differences):
var connectionFactory1 = new MQQueueConnectionFactory();
connectionFactory1.setConnectionNameList("server1,server2,server3");
conectionFactory1.setPort(1414);
....
var connectionFactory2 = new MQQueueConnectionFactory();
connectionFactory2.setConnectionNameList("server2,server3,server1");
conectionFactory2.setPort(1414);
....
var connectionFactory3 = new MQQueueConnectionFactory();
connectionFactory3.setConnectionNameList("server3,server1,server2");
conectionFactory3.setPort(1414);
....
The idea behind this setup is to be able to utilize all the three Queue Managers at the same time. The first connection factory will consume from QM1
, the second connection factory will consume from QM2
, and so on.
From time to time the servers running IBM Queue Managers need to be patched and restarted. When this happens, obviously the queue managers needs to be shut down. While a Queue Manager is down all the traffic is redirected through the other two Queue Managers in the cluster so the flow of the messages never stops.
While server1
is down:
connectionFactory1
switches to server2,server3
so consuming from QM2
connectionFactory2
switches to server2,server3
so consuming from QM2
connectionFactory3
switches to server3,server2
so consuming from QM3
After patching server1 we start QM1.
The issue we have is that the three connection factories stay switched as above without reconnecting to QM1
at all. The only one way we were able to restore the desired state was by restarting the application which is not really a good/acceptable solution.
In our client code we implemented some resiliency patterns to find out when the QM1 comes back up and reset connectionFactory1
(spring CachingConnectionFactory
wrapped around MQQueueConnectionFactory
) as well as stopping and started all listener containers consuming for that QM1 as prefered queue manager but this had no effect. The only way we could do it was to actually restart Spring Application Context but this is similar to actually restarting the application. And when you have many such applications this is really not a good solution.
I noticed that MQQueueConnectionFactory has a method setClientReconnectOptions(int options) throws javax.jms.JMSException
but reading the comment of that method did not make it very clear to me if that can be used for what we want.
Thank you in advance for your inputs.
Reconnect options are for re-making the connection after a failure. It will not affect re-making the connection just because the set of connections is unbalanced.
For more about reconnectable clients, see Reconnectable clients in the IBM Docs.
This is not an easy problem to solve from inside the application because it does not know about the whole environment. That is why IBM MQ now has a feature called Uniform Clusters with Application Rebalancing which does EXACTLY this. When you start up the recycled queue manager, the cluster (which has a picture of the whole set of client connected applications) notices the imbalance and tells some of the applications to go elsewhere. It utilises the client application ability to reconnect as per the above options, but the driver to move to another queue manager comes from the queue manager it is currently connected to, rather than being determined by the client.
The Uniform Clusters feature and Application Balancing were added in IBM MQ V9.1.2 and enhanced in several of the subsequent CD releases. The first LTS release to provide it would therefore be IBM MQ V9.2.0.
For more about Uniform Clusters, see About uniform clusters in the IBM Docs.