Akka-cluster holds and keeps storing wrong info about membership on EC2/docker

Environment: scala-2.11.x, akka-2.5.9

There are two hosts on EC2: H1 and H2. There are tree modules of sbt-project: master, client and worker. Each module implement akka-cluster node, which subscribes to a cluster events and logs them. Also each node logs a cluster state every 1 minute (for debug). The following ports are used for cluster-nodes:master: 2551, worker: 3000, client: 5000

The project available at github

The more details about infrastructure: my previous question

A module can be redeployed in H1 or H2 randomly.

There is a strange behavior of the akka-cluster. When one of nodes (for example worker) is redeployed. The following steps illustrate a history of deploying:

The initial state - when worker is deployed on H1 and master and client are deployed on H2

----[state-of-deploying-0]---  
H1 = [worker]
H2 = [master, client]

cluster status:    // cluster works correctly
  Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up)
  Member(address = akka.tcp://ClusterSystem@H2:2551, status = Up)
  Member(address = akka.tcp://ClusterSystem@H2:5000, status = Up)
----------------

After that the worker module has been redeployed on host H2

----[state-of-deploying-1]---  
H1 = [-]
H2 = [master, client, worker (Redeployed)]

cluster status:    // WRONG cluster state!
  Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up) // ???
  Member(address = akka.tcp://ClusterSystem@H2:2551, status = Up)
  Member(address = akka.tcp://ClusterSystem@H2:3000, status = WeaklyUp)
  Member(address = akka.tcp://ClusterSystem@H2:5000, status = Up)
----------------

The above situation happens occasionally. In this case a cluster stores a wrong state of membership and will not repair it:

Member(address = akka.tcp://ClusterSystem@H1:3000, status = Up) // ???

The host H1 doesn't contain any instances of worker. And > telnet H1 3000 returns connection refused. But why does the akka-cluster keep storing this wrong info?

Solution

This behaviour is intended, and Akka cluster in production should be run with no automatic downing, to prevent split-brain problems.

Imagine a two node (A and B) cluster with two client (X and Y):

At the beginning, everything is ok and a request of Y connected to B can be forwarded to Actor1 running on A for processing through remoting
Then, because of a network partitioning, A becomes unreachable from B, B might be tempted to mark A as down and restart Actor1 locally.
Client Y sends messages to Actor1 running on B
Client X on the other side of network partitioning is still connected to A and will send messages to Actor1

This is a split-brain problem: the same actor with the same identifier is running on two nodes with two different states. It is very hard or impossible to rebuild the correct actor state.

To prevent this to happen, you have to pick up a reasonable downing strategy for your problem or use case:

If you can afford it, use manual downing. An operator will recognize that the cluster is really down, and mark nodes unreachable.
If your cluster has a dynamic number of nodes, then you need something sophisticated as the Lightbend Split Brain Resolver
If your cluster is static, you can use quorum strategies to avoid split brain. You always need to run an odd number of nodes.