Search code examples
cloud-foundryhazelcastjava-11

Hazelcast instances not connecting to Master over TCP


I'm using Hazelcast (v5.2.1) in embedded mode, within my Spring Boot application, with a TCP IP configuration, I'm experiencing an intermittent issue where new instances will not attempt to connect to the Master IP of an existing cluster. I've only seen the behavior occur when a new instance obtains the Master IP from a Non-Master node.

With trace level logs on, I first see the following logs:

"thread":"main","level":"DEBUG","logger":"c.h.i.s.tcp.TcpServerConnectionManager","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Connection to: [domain_1]:5701 streamId:-1 is not yet in progress"
"thread":"hz.peaceful_faraday.cached.thread-2","level":"TRACE","logger":"c.h.i.server.tcp.TcpServerConnector","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Starting to connect to [domain_1]:5701"

I then see the following logs continually print, until the instance crashes:

thread":"main","level":"DEBUG","logger":"c.h.internal.cluster.impl.TcpIpJoiner","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Sending join request to [master_ip]:5701"
thread":"hz.magical_elgamal.generic-operation.thread-0","level":"DEBUG","logger":"c.h.i.cluster.impl.ClusterJoinManager","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Handling master response [master_ip]:5701 from [non_master_ip]:5701"
thread":"hz.magical_elgamal.generic-operation.thread-0","level":"DEBUG","logger":"c.h.internal.cluster.ClusterService","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Setting master address to [master_ip]:5701"
thread":"hz.magical_elgamal.generic-operation.thread-0","level":"DEBUG","logger":"c.h.i.cluster.impl.ClusterJoinManager","message":"[new_instance_ip]:5701 [cluster-name] [5.2.1] Handling master response [master_ip]:5701 from [non_master_ip]:5701"

When the behavior above is occurring, I never see the new instance log "Starting to connect to [master ip]". I verified, via logs, that the Master has no record of the new instance that failed to connect. It appears the new instance didn't attempt to run the ConnectTask runnable to connect to the Master.

I'm using Cloud Foundry for hosting the application and I have a limited five-minute window for the application to return that it is connected to the cluster. If the new instance fails to return that it's healthy in that window, Cloud Foundry will restart the instance. Usually after a couple of restarts, one of the new instances will connect to the master successfully.

The cluster spans multiple applications in Cloud Foundry, so I've enabled the Zone Aware partitioning feature for resilience and the cluster size is generally 18+ instances. It's also worth mentioning that I'm providing more than one apps.internal route to the TCP IP Config's members' value.

I've made the following changes, which haven't impacted the behavior I'm seeing:

  • Lowered the hazelcast.max.join.seconds to 30
  • Increased the thread pool size of hz:io executors to 64 threads
  • Increased the thread pool size of default executors to 64 threads
  • Increased hazelcast.io.thread.count to 32
  • Increased hazelcast.io.input.thread.count to 32
  • Increased to hazelcast.io.output.thread.count to 32
  • Explicitly disabled the multicast config
  • Set port auto increment off and manually set the port

With the changes I made above, I expected to see the ConnectTask execute and start a connection to the master or retry the logic to do so.


Solution

  • After further review, I discovered that having multiple instances behind one apps.internal route was my issue when using the default TCP IP SPI. I created a custom SPI to resolve all of the IPs given a route.