Cluster gets into state where members restart repeatedly and clients cannot update the data in the cluster

We've been using Hazelcast for a number of years but I'm new to the group. We have a cluster formed by a dedicated Java application (it's sole purpose is to provide the cluster). It's using the 3.8.2 jars and running JDK 1.8.0_192 on Linux (Centos 7).

The cluster manages relatively static data (ie. a few updates a day/week). Although an update may involve changing a 2MB chunk of data. We're using the default sharding config with 271 shards across 6 cluster members. There are between 40 and 80 clients. Each client connection should be long-lived and stable.

"Occasionally" we get into a situation where the Java app that's providing the cluster repeatedly restarts and any client that attempts to write to the cluster is unable to do so. We've had issues in the past where the cluster app runs out of memory due to limits on the JVM command line. We've previously increased these and (to the best of my knowledge) the process restarts are no longer caused by OutOfMemory exceptions.

I'm aware we're running a very old version and many people will suggest simply updating. This is work we will carry out but we're attempting to diagnose the existing issue with the system we have in front of us.

What I'm looking for here is any suggestions regarding types of investigation to carry out, queries to run (either periodically when the system is healthy or during the time when it is in this failed state).

We use tools such as: netstat, tcpdump, wireshark and top regularly (I'm sure there are more) when diagnosing issues such as this but have been unable to establish a convincing root cause of this issue.

Any help greatly appreciated.

Thanks, Dave

As per the problem description. Our only way to resolve the issue is to bounce the cluster completely - ie. stop all the members and then restart the cluster. Ideally we'd have a system to remained stable and could recover from whatever "event" causes the issue we're seeing. This may involve config or code changes.

Solution

Updating entries the size of 2MBs has many consequences - large serialization/deserialization costs, fat packets in the network, cost of accommodating those chunks in JVM heap etc. An ideal entry size is < 30-40KB.

To your immediate problem, start with GC diagnosis. You can use jstat to investigate memory usage patterns. If you are running into lot of full GCs and/or back-to-back full GCs then you will need to adjust heap settings. Also check the network bandwidth, which is usually the prime suspect in the cases of fat packets traveling through the network.

All of the above are just band-aid solutions, you should really look to break your entries down to smaller entries.