jakarta-ee ejb ejb-3.0 java-ee-6 glassfish-3

Glassfish cluster + Remote instance fails to recover EJB timers

Our setup is: Glassfish version 3.1.2.2 -

DAS and instance-1 running on the same machine, while instance-2 is running on another machine in the same network as config node.
We have set up transaction logging in a shared directory as per the Glassfish High Availability Guide: http://docs.oracle.com/cd/E18930_01/html/821-2416/gjjpy.html#gaxim
We are using unicast configuration for cluster communication since we have Network Load Balancer running in multicast mode in the network.
Our application (.ear containing multiple .war) has 2 persistent timers (since we need only one instance per timer at a time in the cluster).

When instance-1 (or instance-2) is shut down normally, the other instance recovers up the timers from the shut-down instance as expected. When instance-2 crashes or goes offline abnormally, instance-1 recovers its timers (again, as expected). But when instance-1 crashes, instance-2 does not seem to recover its timers as expected.

As far as I can see from the logs, instance-2 receives proper failover message for instance-1 and starts the recovery, but finishes it without recovering any transactions or timers for the failed instance.

Can anyone tell me what the problem can be? (Should I provide any more information?)

Solution

After 2 weeks or so of work, we have finally located the problem.

It seems when an instance in a cluster goes down, the recovery instance checks if the instance is still up by trying to access the "node-host":"admin-node-port" of the downed instance. If you are using the standard created node on the DAS (as we were), the node-host is set to "localhost" (as was done for instance-1).

So, instance-2 was trying to see if instance-1 is down by trying to connect to "localhost", instead of "instance-1-ip" as it should have been. Since it could connect to localhost, the instance-1 was falsely marked as running and the recovery didn't go ahead.

We had to change the node-host for instance-1 node in domain config.xml to fix this, since the configuration of default localhost- cannot be changed through asadmin or admin console.