Akka.net: Should I specify "split brain resolver" configuration for Lighthouse/Seed nodes

I have this application using Akka.net cluster feature. The people who wrote the code have left the company. I am trying to understand the code and we are planning a deployment.

The cluster has 2 types of nodes

QueueServicer: supports sharding and only these nodes should participate in sharding.
LightHouse: They are just seed nodes, nothing else.

Lighthouse : 2 nodes
QueueServicer : 3 Nodes

I see one of the QueueServicer node unable to join the cluster. Both lighthouse nodes are refusing connection. It constantly tries to join and never succeeds. This has been happening for the last 5 days or so and the node is never dying also. Its CPU and memory usage is high. Also It doesn't have any queue processor actors running when filtered search through the log. It takes long hours for Garbage collection etc. I see in the log for this node, the following.

{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-1:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-1:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-1:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-1:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-0:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-0:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-0:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-0:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

There are other "Now supervising", "Stopping" "Started" logs which I am omitting here.

Can you please verify if the HCON config is correct for split brain resolver and Sharding?

I think LightHouse/SeeNodes should not have the sharding configuration specified. I think it is a mistake. I also think, split brain resolver configuration might be wrong in LightHouse/SeedNodes and should not be specified for seed nodes.

I appreciate your help.

Here is the HOCON for QueueServicer Trimmed

akka {
    loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
    log-config-on-start = on
    loglevel = "DEBUG"
    actor {
        provider = cluster
        serializers {
            hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
        }
        serialization-bindings {
            "System.Object" = hyperion
        }
    }

remote {
    dot-netty.tcp {
    ….
    }
}

cluster {
    seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
    roles = ["QueueProcessor"]
    sharding {
        role = "QueueProcessor"
        state-store-mode = ddata
        remember-entities = true
        passivate-idle-entity-after = off
    }

    downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
    split-brain-resolver {
                               active-strategy = keep-majority
                               stable-after = 20s
        keep-majority {
            role = "QueueProcessor"
        }
     }
    down-removal-margin = 20s
}

extensions = ["Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider,Akka.Cluster.Tools"]

}

Here is the HOCON for Lighthouse

remote {
    dot-netty.tcp {
    …
    }
}

cluster {
    seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
    roles = ["lighthouse"]
    sharding {
        role = "lighthouse"
        state-store-mode = ddata
        remember-entities = true
        passivate-idle-entity-after = off
    }

    downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
    split-brain-resolver {
                                  active-strategy = keep-oldest
                                  stable-after = 30s
              keep-oldest {
            down-if-alone = on
            role = "lighthouse"
              }
      }
 }

}

Solution

I meant to reply to this sooner.

Here is your problem: you're using two different split brain resolver configurations - one for the QueueServicer and one for Lighthouse. Therefore, how your cluster resolves itself is going to be quite different depending upon who is the leader of each half of the cluster.

I would stick with a simple keep-majority strategy and use it uniformly on all nodes throughout the cluster - we're very likely going to enable this by default in Akka.NET v1.5.

If you have any questions, please feel free to reach out to us: https://petabridge.com/