WebSphere MQ v7.1 Channels going down

The sender and receiver channels between two queue managers (WebSphere MQ v7.1 running on Redhat Linux) that I have configured is going down pretty frequently. Any idea why? How can I debug this? Thanks.

Solution

Channels are expected to go down. The idea is that they stay active as long as there is traffic and then time out. Assuming they've been configured to trigger, the presence of a message on the XMitQ causes the channel to start up again.

The reason for this is that a triggered channel will generally restart if interrupted by a network failure or other adverse event. However if a channel is configured to stay running 24x7 then the only way it stops is due to one of these adverse events and that increases the likelihood that human intervention will be required to restart the channel. On the other hand, a channel that times out can survive all sorts of nasty network events that occur while it is inactive. Allowing it to time out when not in use thus improves overall reliability of the channel.

So how do you cause a channel to trigger? Make sure the transmission queue contains the TRIGGER, TRIGTYPE, TRIGDATA and INITQ attributes. For example, to define a transmission queue to the JUPITER QMgr:

DEF QL(JUPITER) +
    USAGE(XMITQ) +
    TRIGGER +
    TRIGTYPE(FIRST) +
    TRIGDATA('MYQMGR.JUPITER') +
    INITQ(SYSTEM.CHANNEL.INITQ) +
    REPLACE

The only variable of the bunch is TRIGDATA which contains the name of the channel serving this XMitQ.

Of course, the channel initiator must be running but in modern versions of WMQ it starts by default (based on the value of the queue manager's SCHINIT attribute) so generally will in fact be running.

The channel that is in STOPPED state cannot be triggered. By default the STOP CHL command uses STATUS(STOPPED) so most of the time manually stopping a channel prevents triggering. If you want to stop a channel in such a way that it will restart (for example to test triggering) use the STOP CHL(CHLNAME) STATUS(INACTIVE) command. If the channel is already in STOPPED state, either issue the START CHL command to make it start immediately or use the STOP CHL(CHLNAME) STATUS(INACTIVE) to change the status from STOPPED to INACTIVE without starting it.

Once the channels are up, the DISCINT attribute of the channel determines how long it will run before timing out. The value is in seconds and defaults to 600 which is 10 minutes. The DISCINT, KAINT and HBINT combine to determine when the channel comes down. Note that the TCP spec calls for things using keepalive to disable them by default so if you want to use keepalive on your channels, you must enable it in the QMgr tuning as described here.

Please see Triggering Channels in the Infocenter for more on the configuration details. Take a look at SupportPac MD0C WebSphere MQ - Keeping Channels Up and Running if you want to know more about the internals and tuning. (The SupportPac is a bit dated but the principles of tuning mostly still apply. Where there are discrepancies, the Infocenter is the authoritative version.)

If you want to keep channels up continuously, set DISCINT(0) but remember that triggering remains the preferred option. Some shops need to minimize response times during the business day and so set DISCINT to a value that allows the channels to time out at night but generally keeps them running all day. If for some reason you have triggering set up right and the channels go down prior to DISCIINT you should be able to check in the error logs for the reason why. These reside in the QMgr's directory under errors. For example, on UNIX/Linux they are in /var/mqm/qmgrs/qmgrname/errors and on Windows the default location is C:\Program Files(x86)\WebSphere MQ\QMgrs\qmgrname\errors. Look for the files named AMQERR??.LOG where ?? = 01, 02, or 03. The logs rotate where 01 is current, 02 is next and so on. If you have a very busy QMgr you need to capture these as soon as the channel goes down or they could roll off.