Search code examples
jgroups

Need to solve JGroups non-grouping behavior for a two-host group


I'm using JGroups to make a cluster of two machines by having them join the same-named group, and I'm getting erratic grouping behavior, and I'm asking if there's either a JGroups configuration change or a server configuration change I need to do to make it work.

When I have two hosts in a JGroups group, and one member leaves, this is observed by the other member in viewAccepted. But when the earlier member returns, neither of them see each other in the viewAccepted function, and they run in effectively independent groups (bad). This non-grouping behavior can be random. For example, I restarted one host this morning after I had disconnected it at 5pm yesterday, and neither saw the other in the group. However, once I stopped both of them and restarted, they found each other in the group.

The ultimate use of JGroups is to cluster two Tomcat hosts running the same application such that the first host listed in the cluster runs reports while the second waits for the first to die, then takes the first host's place and runs reports. When the first host is returned to service, it becomes the second in the cluster membership list, and waits for the other to die. Right now the application's behavior echoes that of the test program, below.

Using JGroups jgroups-4.1.6.Final.jar out of the box with no custom udp.xml file and no arguments to new JChannel(). I have a test app that runs on two RHEL 7 hosts to get them to cluster. Code below:

import java.util.List;

import org.jgroups.Address;
import org.jgroups.JChannel;
import org.jgroups.Message;
import org.jgroups.Receiver;
import org.jgroups.View;


public class ClusterTest implements Receiver {
  private String clusterName;
  private JChannel channel;


  public ClusterTest(String[] args) {
    if (args.length > 0) {
      clusterName = args[0];
    } else {
      clusterName = "TEST_CLUSTER";
    }
  }

  /**
   * Joins the JGroups cluster
   */
  private void joinCluster() {
    System.out.println("joinCluster(): joining cluster \"" + clusterName + "\"");
    try {
      channel = new JChannel();
      channel.setReceiver(this);
      channel.connect(clusterName);
    } catch (Exception e) {
      System.out.print("joinCluster failed: ");
      e.printStackTrace();
    }
  }

  @Override
  public void viewAccepted(View view) {
    System.out.println("viewAccepted(): view = \"" + view + "\" for name = \"" + channel.getName() + "\" / \"" + channel.clusterName() + "\"");
    List<Address> viewMembers = view.getMembers();
    int memNum = 0;
    for (Address member : viewMembers) {
      System.out.println("viewAccepted(): member #" + memNum++ + " = " + member);
    }
    String myClusterName = channel.getName();
    System.out.println("viewAccepted(): my clusterName = \"" + myClusterName + "\"");
    if (viewMembers.size() > 0) {
      String clusterActive = viewMembers.get(0).toString();
      System.out.println("viewAccepted(): " + clusterName + ", cluster active = \"" + clusterActive + "\"\n");
    }
  }

  @Override
  public void receive(Message arg0) {
    System.out.println("receive called, message = \"" + arg0 + "\"");

  }
  /**
   * @param args
   */
  public static void main(String[] args) {
    ClusterTest tester = new ClusterTest(args);
    tester.joinCluster();
  }

}

I exit the test application by hitting control-C, so that is an ungraceful exit.

Our Linux admin has looked at the logs and has seen nothing out of the ordinary. The admin also said that there are no restrictions or configurations that would inhibit UDP traffic.

Things I've tried:

  • Increasing the send and receive buffer sizes for Multicast, as per this solution. I now get no warning messages on startup due to inadequate buffer size.
  • Updating JGroups version to latest.
  • Waiting overnight to try to start a stopped host.

Things I don't want to try:

  • Using TCP, since the cluster membership is fluid (it can be any two hosts) and I would rather not update a configuration file whenever the host changes.

I am open to alternate solutions in case I can't get JGroups to work reliably.

Thanks!


Solution

  • (Answering my own question) I went ahead and used TCP even though I said I didn't want to, but I was somewhat desperate and late on software delivery. I didn't see the bad behavior I was seeing with UDP, which included the appearance that the hosts were ignoring each other indefinitely. I modified the vanilla TCP configuration file to give me the performance I needed.

    I'm not going to pursue a UDP solution since I've got other work to do, and the TCP solution seems to work sufficiently to put into Production.