Search code examples
akkaactorakka-clusterakka-remoting

Actor Cluster Sharding Remember entities not properly recreating while Rolling restart of nodes


Problem

In a 3 node Actor cluster, While doing rolling restart of 3 Nodes Remembered entities are not recreating properly.
The Shards are completely rebalanced but Some of the entities not recreated.

Cluster Configurations

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
        transport = tcp
}

At start all the 3 nodes will have 100 shards in each node with 1000 Actors totally 300 Shards And 3000 Actors.

  • Node 1 -- 100 Shards \ 1000 Actors
    
  • Node 2 -- 100 Shards \ 1000 Actors
    
  • Node 1 -- 100 Shards \ 1000 Actors
    

1.When Node 1 Down Shards on node 1 rebalanced to Node 2 And node 3 with all the remembered entities recreated on those nodes.

  • Node 1 -- Down
    
  • Node 2 -- 150 Shards \ 1500 Actors
    
  • Node 1 -- 150 Shards \ 1500 Actors
    

2.When Node 1 is Up after few moments Node 2 getting Down .Shards and the Remembered entities on Node 2 is recreated to Node 1.

  • Node 1 -- 150 Shards \ 1500 Actors
    
  • Node 2 -- Down
    
  • Node 1 -- 150 Shards \ 1500 Actors
    

3.When Node 2 is Up after few moments Node 3 getting down.Shards and the Remembered entities on Node 2 is recreated to Node 2 but some of the entities not recreated to Node 2 from Node 3. All the Shards are rebalanced anyway.

  • Node 1 -- 150 Shards \ 1500 Actors
    
  • Node 2 -- 150 Shards \ 1423 Actor
    
  • Node 1 -- Down
    

The issue here is

When we restart the Node 3 after the Node 2 joined the Cluster the recreation of Remembered entities is inconsistent.
In mean time there are messages will be send to the Actors on the Cluster.

What can be the bottleneck here when the Node 3 Restarted right after the Node 2 joins?

Tried

1.If we are not restarting the Node 3 there is no issue with the Entities.
2.If we restart the Node 3 alone in rolling restart after some time there is no problem.
3.Increased\decreased Shard count.
4.Changed akka.cluster.distributed-data.majority-min-cap from default 5 to 3 still issue persists.

Is there any configurations need to be tuned?

In which part we need to debug further to find the root cause?


Solution

  • Answer for the Own Question.

    While Debugging Further We found that the issue is with respect to replicating the Remember entities across the nodes in between the frequent restarts.

    Debugging

    • In the Actor Cluster, a Replicator Actor is created on each Node for every Role.
    • The Remember entities are stored using a key-value pair approach.
      Within a Shard, there are five keys, and the Entities are stored using an ORSet data structure.
    • These key-value pairs are replicated across the nodes using the Gossip protocol.

    enter image description here

    To retrieve the above key-value pairs, we can send a Get message to the Replicator Actor. This message will allow us to fetch the desired key values stored in the Actor Cluster.

    Key<ORSet<String>> rememberEntitiesKey = ORSetKey.create(orSetKey);
    Get<ORSet<String>> getCmd = new Get<ORSet<String>>(rememberEntitiesKey,Replicator.readLocal());
    
    Future<Object> ack = Patterns.ask(replicator, getCmd, timeout1).toCompletableFuture();
    Object result =ack.get(5000, TimeUnit.MILLISECONDS);
    
    Replicator.GetSuccess<ORSet<String>> Orset = (GetSuccess<ORSet<String>>) result;
    The value Contains entities in the Local ddata ---> Orset.dataValue().getElements();
    

    To find out the current state of Remember entities on each node, we can check the local data of that node. By looking at the local data, we can see how the Remember entities are currently stored

    For the case mentioned in the Question the local replicator Rememeber entities data.

    • Initially, all nodes in the cluster have 3000 Remember entities. When Node 1 goes down, Node 2 and Node 3 still have the 3000 Remember entities in memory.
    • When Node 1 comes back up, Node 2 goes down. Node 3 retains all 3000 Remember entities and Node 1 replicates from Node 3 to restore its data.
    • When Node 2 comes back up, Node 3 goes down. However, Node 1 was not fully replicated from Node 3 before, so it only has around 2900+ entities in memory. Node 2 retrieves the missing entities from Node 1.
    • When Node 3 comes back up, it replicates from both Node 1 and Node 2, resulting in the recreation of the 2900+ entities.
    • As a result, the cluster ends up with only the 2900+ entities in its final state.

    Issue :: UnderReplication in-between the frequent Restart

    Due to frequent restarts, the Replicator may fail to fully replicate the data, resulting in incomplete or inconsistent data in the cluster By tuning the below properties we have resolved this issue

    enter image description here

    Configured to this value

    akka.cluster.sharding.distributed-data {
               gossip-interval = 500 ms        // default 2 s
                notify-subscribers-interval = 100 ms   // default 500 ms
            }
    

    After implementing these changes, even during frequent rolling restarts, the Remembered Entities data is now fully replicated to all nodes and all the entities are recreated successfully.