Our setup is: Glassfish version 3.1.2.2 -
When instance-1 (or instance-2) is shut down normally, the other instance recovers up the timers from the shut-down instance as expected. When instance-2 crashes or goes offline abnormally, instance-1 recovers its timers (again, as expected). But when instance-1 crashes, instance-2 does not seem to recover its timers as expected.
As far as I can see from the logs, instance-2 receives proper failover message for instance-1 and starts the recovery, but finishes it without recovering any transactions or timers for the failed instance.
Can anyone tell me what the problem can be? (Should I provide any more information?)
After 2 weeks or so of work, we have finally located the problem.
It seems when an instance in a cluster goes down, the recovery instance checks if the instance is still up by trying to access the "node-host":"admin-node-port" of the downed instance. If you are using the standard created node on the DAS (as we were), the node-host is set to "localhost" (as was done for instance-1).
So, instance-2 was trying to see if instance-1 is down by trying to connect to "localhost", instead of "instance-1-ip" as it should have been. Since it could connect to localhost, the instance-1 was falsely marked as running and the recovery didn't go ahead.
We had to change the node-host for instance-1 node in domain config.xml to fix this, since the configuration of default localhost- cannot be changed through asadmin or admin console.