kubernetes akka akka-cluster rolling-updates

Akka cluster handover during rolling updates in kubernetes

I have been trying to deploy an akka-cluster in kubernetes with the recommended rollingUpdate plus cluster app-version for a smooth deployment with minimum downtime. However the handover process is causing a latency increase and subsequent downtime for the duration of the deployment.

The current rolling update config I am using is maxSurge=1 and maxUnavailable=0.

I would like to understand how the handover works and what's the role of the app-version in this process as I could not find any documentation on this.

Are messages still sent to older shards when a new version is onboarded? Or are all messages flowing to the new version causing a temporary bottleneck? Is there anyway that this can be improved to guarantee a higher availability? Any thoughts or ideas will be appreciated.

Solution

As part of the process of a rolling redeploy with zero maxUnavailable and a positive maxSurge, nodes with the new app version will be started and join the cluster. As they join the cluster, the shard coordinator (a cluster singleton, which is running on the oldest node in the cluster) will notice that they have responsibility for no shards and will allocate shards to them (the precise algorithm is configurable as to how aggressively it rebalances and which shards are picked). While the selected shards are being moved, messages to those shards will be buffered. Note that there will be an orderly shutdown of the entities hosted by the shard: if the actors incarnating those entities are slow to stop (e.g. they have "tidying up" work to do), the shard shutdown process will after some delay (generally tens of seconds, though this is tunable) do it in a "less orderly" fashion: when this is done, the shard will start up on the new node and the other nodes will flush their buffered messages.

After the new node reports healthy, a node of the previous version is chosen by Kubernetes to be stopped. Ideally this node is able to leave the cluster gracefully: in that situation, it will give up responsibility for its shards and the shard coordinator will allocate the shards to a node with the new app-version (this is in fact where app-version applies: this prevents the shard from starting up on a node that's destined to be stopped by Kubernetes during the deployment). If it's an ungraceful departure (rather than announcing to all the other nodes that it's leaving, the other nodes decide that it has failed), then it can take many tens of seconds for the split-brain-resolver on the other nodes to down it and the shards to be recreated.

At some point during the process, the node hosting the shard coordinator will be stopped in the rolling upgrade and after some delay the new oldest node will take over the cluster singletons (including the shard coordinator). It's best if that oldest node is the last node to be stopped: then the responsibility will pass to the first node to be deployed with the next version. Kubernetes used to approximately stop nodes in an order where that was true: however fairly recent versions changed that heuristic to make that less likely, especially for longer-lived deployments. A recent release of the BSL (non-open-source, just source-available) Akka added support for telling Kubernetes that it should avoid stopping the node hosting the singletons for as long as it can: using that module in production requires a license from Lightbend (my employer).

A latency spike during a shard rebalance is to be expected: it takes time for the node hosting the shard to rebuild state (and keeping that state is the reason one uses cluster sharding). You can ameliorate the spike by (not all of these might apply to your situation and this is not exhaustive):

optimizing the time for your entities to stop
not using "remember entities" when there are more than a few tens of entities per shard, as remember entities will start entity actors on the new node in an order that is unlikely to correlate to the order in which the entities are needed and the remembered entities will be competing for resources with the needed entities
increasing maxSurge, though this should be remain ideally less than akka.cluster.min-nr-of-members (to prevent new nodes from forming a separate cluster): among other things this may reduce the load on the first few new app-version nodes.
if using Akka Persistence, it's possible that the "max concurrent recoveries" config is limiting the rate at which your entity actors can start handling traffic: this setting is there to prevent recovery from overloading your database, so be careful.
if event sourcing, particularly with very long-lived entities, more aggressive snapshotting might help (though my experience with most of the backends is that even thousands of events between snapshots often doesn't have that bad of an impact, though the size of your events relative to the size of the state will be relevant here).