Automatically down nodes in Akka cluster with marathon-api after deployment

I have an application deploying an Akka cluster using marathon-api with ClusterBootstrap

When a deployment runs it does the following:

Adds a new instance with the new version of the application
Kills one of the old instances
Repeat until is done

We have a cluster of 4 nodes

After doing a deployment the cluster looks like this (assuming 2 instances in this example):

{
  "leader": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
  "members": [
    {
      "node": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
      "nodeUid": "-1598489963",
      "roles": [
        "dc-default"
      ],
      "status": "Up"
    },
    {
      "node": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
      "nodeUid": "-1604243482",
      "roles": [
        "dc-default"
      ],
      "status": "Up"
    }
  ],
  "oldest": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
  "oldestPerRole": {
    "dc-default": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724"
  },
  "selfNode": "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655",
  "unreachable": [
    {
      "node": "akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724",
      "observedBy": [
        "akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655"
      ]
    }
  ]
}

Sometimes the leader remains WeaklyUp but the idea is the same, while gone nodes can be either up or Leaving.

Then the logs start showing this message:

Cluster Node [akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655] - Leader can currently not perform its duties, reachability status: [akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655 -> akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724: Unreachable [Unreachable] (1)], member status: [
akka.tcp://app@ip-10-0-4-8.eu-central-1.compute.internal:15724 Up seen=false, 
akka.tcp://app@ip-10-0-5-15.eu-central-1.compute.internal:13655 Up seen=true]

And deploying more times make this even worst

I imagine that when a node is killed then that alters the state of the cluster because then it is in fact not reachable, but I was hoping there could be some kind of feature that will solve this issue

Until now the only thing that works to solve this is to use Akka Cluster HTTP Management doing PUT /cluster/members/{address} operation:Down

I know there was a feature called Auto-downing which was removed because it was doing more harm than good.

I also tried Split Brain Resolver with the strategies provided there, but at the end those just end up downing the complete cluster, with a log like this:

> Cluster Node [akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211] - Leader can currently not perform its duties, reachability status: [akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211 -> akka://app@ip-10-0-4-146.eu-central-1.compute.internal:2174: Unreachable [Unreachable] (1)], member status: [akka://app@ip-10-0-4-146.eu-central-1.compute.internal:2174 Up seen=false, akka://app@ip-10-0-5-215.eu-central-1.compute.internal:43211 Up seen=true]
> Running CoordinatedShutdown with reason [ClusterDowningReason]
> SBR is downing
> SBR took decision DownReachable and is downing

Maybe I have not setup the right strategy with the right configuration, but I am not sure what to try, again I have a 4 nodes cluster, so I will guess the default Keep Majority should do it, although this case is more of a crashed node than a network partition

Is there a way to have a smooth deployment of an Akka Cluster using marathon-api ? I am open to suggestions

Update: I was also updating Akka version from 2.5.x to 2.6.x which the documentation states it is not compatible, so I needed to intervene manually the first deployment. At the end using the Split Brain Resolver with default configuration did work

Solution

You'll need to use a "real" downing provider like Split Brain Resolver. This lets the cluster safely down nodes that are unreachable. (As opposed to the auto downing, which downs them without consideration of it is safe or not.)

There's a separate question of why DC/OS is killing the nodes so quickly they don't get the chance to properly shut down. But I don't know DC/OS well enough to say why that could be. And, regardless, a downing provider is essential for clustered environments so you will want to get that in place anyway.

Edited due to your comments about SBR:

First I want to be clear, the marathan-api is almost certainly irrelevant here. marathon-api is how nodes discover other nodes in DC/OS. The problems you are having are with fundamental cluster problems, namely unreachable nodes. A cluster with unreachable nodes is going to act the same way, regardless of where it is running and how the nodes are discovered.
Fundamentally my best guess is that you are having problems getting clean shutdowns. If SBR is downing your entire cluster it's because it is getting to a point where there are more unreachable nodes than live clusters.

As an example, what might be happening:

You have 4 live nodes and want to upgrade.
DC/OS kills the first node. For some unknown reason you aren't getting a clean shutdown so the node is marked as unreachable. (Essentially the cluster, because it wasn't a clean shutdown doesn't node if the node still exists but is unresponsive and/or behind a partition.) There are 3 live nodes and one unreachable.
DC/OS starts the second node. Perhaps it takes a while to boot your application. So you have 3 live nodes, one unreachable node, and 1 unready node.
DC/OS kills another node. So you have 2 live nodes, 2 unreachable nodes, and 1 unready node. At this point SBR can no longer guarantee that you haven't had a network partition because you don't have majority. At this point, in order to prevent corruption, it must stop the cluster.

So, I would recommend the following:

I don't know the details of DC/OS well enough but you probably need to slow down your rolling upgrades. In K8S I'd use something like MinReadySeconds.
You may want to consider a fifth node. Odd numbers are better because it makes majority easier to determine.
If you continue to have problems you'll need to provide more logs from the SBR decision.

SBR is the answer here. I realize that you aren't having real network partitions, but the fact that you are having unreachable nodes means that the Akka Cluster is unable to tell if there are network partitions or not and that's the root cause of the problem.