Nodes won't reconnect to seed

We have three nodes in different AWS data centers, one of which is the sole seed node and exclusive possessor of a singleton, accomplished by using .withDataCenter on the singleton proxy settings. We can get our cluster to work as-designed by starting the seed node and then the others, but if any of the nodes go down it seems the only way to get them talking again is to restart the whole cluster in the same way. We'd like to get these to try to reconnect to the seed node and resume normal operation when they can.

When I take down a non-seed node, the seed node marks it as UNREACHABLE and begins to periodically log the following:

Association with remote system [akka.tcp://[email protected]:xxxx] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[email protected]:xxxx]] Caused by: [connection timed out: /xxx.xx.x.xxx:xxxx]

Fair enough. When I bring the node back up, however, the newly-started node begins to repeat:

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.ClusterCoreDaemon in application-akka.actor.default-dispatcher-18 - now supervising Actor[akka://application/system/cluster/core/daemon/joinSeedNodeProcess-16#-1572745962]

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-3 - started (akka.cluster.JoinSeedNodeProcess@2ae57537)

2018-01-29 22:59:09,755 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-2 - stopped

The seed node logs:

2018-01-29 22:56:25,442 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-4 - Cluster Node [akka.tcp://[email protected]:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://[email protected]:xxxx, dataCenter = indonesia, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

2018-01-29 22:56:25,443 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://[email protected]:xxxx] dc [asia] - Marking unreachable node [akka.tcp://[email protected]:xxxx] as [Down]

and repeatedly thereafter:

2018-01-29 22:57:41,659 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://[email protected]:xxxx] dc [asia] - Sending InitJoinAck message from node [akka.tcp://[email protected]:xxxx] to [Actor[akka.tcp://[email protected]:xxxx/system/cluster/core/daemon/joinSeedNodeProcess-8#-1322646338]]

2018-01-29 22:57:41,827 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://[email protected]:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://[email protected]:xxxx, dataCenter = indonesia, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

It seems strange to me that the log indicates things "will" happen that don't, the existing being removed and the new member being allowed to join. I've been googling that message and can't find an explanation of what I may need to do to make that actually happen.

Solution

With the assumption that you're on Akka.NET, it looks like you might've hit on an open issue, in which the leader keeps trying to remove the old incarnation to let the new incarnation join. There were some troubleshooting suggestions in the issue ticket about relaxing heartbeat-interval that may perhaps provide some insight into possible causes.

Given the generally higher latency across multiple geographically dispersed datacenters, an area I would look closely into is failure detection.

This may not seem relevant to the reported problem, but based on your displayed logs it appears that there were time discrepancies between the two nodes in different datacenters.