Which way is better when recovering QMGR?

Normally, we have two ways to recover QMGR. One is backup and restore QMGR data&log, and the other is create backup QMGR. My question is which one is better for QMGR recovery situation? Or they both have their own usage scenarios? Please help answer this.

Thanks

Solution

Recovery from what exactly? With what recovery time objective or what recovery point objective? The optimum answer answer depends these requirements and also on how your applications are written. The best way from an architectural perspective is to treat messaging like a transport. When you back up your network routers, you don't back up messages that happen to be in-flight on the routers, you just back up the router configurations. Same thing with a Queue Manager. If you back up the object definitions, authorizations, ini files, exits and exit parms you can recreate an empty version of the QMgr and resume where the old one left off.

Unfortunately, most applications are designed as if the messaging layer were a database rather than a transport. This means they wish to recover messages from a downed QMgr. That is what the Backup QMgr is used for. The way it works is that the primary QMgr uses linear log files. Over time the active log files advance and old files that are no longer required for restoration "roll off" the end of the log set. These files are then shipped to where the backup QMgr lives and applied. The backup QMgr will then have an exact copy of the messages that were in the queues when that log file was last active.

There will always be a lag between the messages on the primary queue manager and those on the backup queue manager that lag is represented by the space consumed by the primary and secondary active log extents. If the primary and secondary log extents are kept small in size and number, the number of messages lost in the failover can be minimized. A recovery point of zero cannot be achieved this way, however it is a LOT better than point-in-time backups.

Which leads us to the other backup methodology you mentioned. Point-in-time backups (i.e. backing up the QMgr's queue and log files) cannot work if the backup is taken while the QMgr is running. On a busy QMgr the logs and queues are constantly written to and must be in synch. But backing up these files while active pretty much guarantees that the backed up logs won't synch with the backed up queues. It is possible the QMgr will be restored with damaged queues or that the QMgr will not even start after restoration.

The only time this backup strategy works is if the QMgr is stopped and then it is best used for a recovery option after upgrade rather than for an active system. For example, say you take a valid point-in-time backup Sunday morning at 1am. Then during the week somebody deletes a queue file out from under the QMgr and you need to restore it. Restoring the onbe file won't work because it will be out of synch with the log and show as a damaged object. You must restore the entire QMgr. What you get back is all of the messages that were on all the queues as of 1am last Sunday. Worse, if the QMgr participates in a cluster, restoring it to a prior point in time resets the sequence numbers on the cluster command messages so even though it looks like the QMgr is restored and healthy , the cluster may ignore it or any changes you make to it.

The one backup strategy that is most common but not mentioned in your post is to back up the QMgr configuration. This includes:

Object definitions
Authorizations
Exit directories
Exit parms
ini files

From these you would be able to recreate the queue manager configuration, and all of these backups can be done while the QMgr is running. When restoring it generates an empty QMgr to which the applications can connect just as before. The main requirement is that the applications (or human processes) must reconcile any missing messages.

There is one disaster recovery approach in which to achieve a zero recovery point - i.e. not lose any messages. That uses synchronous disk replication under the QMgr's files. Each update to a queue or log file is replicated in real time to the disaster recovery site so the DR QMgr has an exact copy of the primary QMgr. When the primary goes down you break the replication and fire up the DR QMgr. Assuming your DNS is configured to also fail over, all remote QMgrs and programs will use the DR QMgr as if it were the primary.

There are a couple of HA options as well. Using a hardware cluster such as PowerHA or Veritas Cluster Server can fail over a QMgr from one server to another provided the QMgr's files are hosted on highly available disk such as SAN. The Multi-Instance QMgr can perform a similar failover without hardware cluster software and is based on highly available NFS storage. These are both HA solutions rather than DR solutions because both QMgr instances see the same disk storage. They must therefore be close to the same distance (in network terms) from that disk storage or else performance on the most distant QMgr will suffer from latency and throughput may not be acceptable.

Additional info is available in the Availability, recovery and restart topic of the Infocenter.