I have this application using Akka.net cluster feature. The people who wrote the code have left the company. I am trying to understand the code and we are planning a deployment.
The cluster has 2 types of nodes
QueueServicer: supports sharding and only these nodes should participate in sharding.
LightHouse: They are just seed nodes, nothing else.
Lighthouse : 2 nodes
QueueServicer : 3 Nodes
I see one of the QueueServicer node unable to join the cluster. Both lighthouse nodes are refusing connection. It constantly tries to join and never succeeds. This has been happening for the last 5 days or so and the node is never dying also. Its CPU and memory usage is high. Also It doesn't have any queue processor actors running when filtered search through the log. It takes long hours for Garbage collection etc. I see in the log for this node, the following.
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-1:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-1:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-1:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-1:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task
1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
{"timestamp":"2021-09-08T22:26:59.025Z", "logger":"Akka.Event.DummyClassForStringSources", "message":Tried to associate with unreachable remote address [akka.tcp://myapp@lighthouse-0:7892]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://myapp@lighthouse-0:7892] Caused by: [System.AggregateException: One or more errors occurred. (Connection refused akka.tcp://myapp@lighthouse-0:7892) ---> Akka.Remote.Transport.InvalidAssociationException: Connection refused akka.tcp://myapp@lighthouse-0:7892 at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress) at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress) --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__12_18(Task
1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
There are other "Now supervising", "Stopping" "Started" logs which I am omitting here.
Can you please verify if the HCON config is correct for split brain resolver and Sharding?
I think LightHouse/SeeNodes should not have the sharding configuration specified. I think it is a mistake. I also think, split brain resolver configuration might be wrong in LightHouse/SeedNodes and should not be specified for seed nodes.
I appreciate your help.
Here is the HOCON for QueueServicer Trimmed
akka {
loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
log-config-on-start = on
loglevel = "DEBUG"
actor {
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
}
remote {
dot-netty.tcp {
….
}
}
cluster {
seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
roles = ["QueueProcessor"]
sharding {
role = "QueueProcessor"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-majority
stable-after = 20s
keep-majority {
role = "QueueProcessor"
}
}
down-removal-margin = 20s
}
extensions = ["Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubExtensionProvider,Akka.Cluster.Tools"]
}
Here is the HOCON for Lighthouse
akka {
loggers = ["Akka.Logger.log4net.Log4NetLogger, Akka.Logger.log4net"]
log-config-on-start = on
loglevel = "DEBUG"
actor {
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
}
remote {
dot-netty.tcp {
…
}
}
cluster {
seed-nodes = ["akka.tcp://myapp@lighthouse-0:7892",akka.tcp://myapp@lighthouse-1:7892"]
roles = ["lighthouse"]
sharding {
role = "lighthouse"
state-store-mode = ddata
remember-entities = true
passivate-idle-entity-after = off
}
downing-provider-class = "Akka.Cluster.SplitBrainResolver, Akka.Cluster"
split-brain-resolver {
active-strategy = keep-oldest
stable-after = 30s
keep-oldest {
down-if-alone = on
role = "lighthouse"
}
}
}
}
I meant to reply to this sooner.
Here is your problem: you're using two different split brain resolver configurations - one for the QueueServicer and one for Lighthouse. Therefore, how your cluster resolves itself is going to be quite different depending upon who is the leader of each half of the cluster.
I would stick with a simple keep-majority strategy and use it uniformly on all nodes throughout the cluster - we're very likely going to enable this by default in Akka.NET v1.5.
If you have any questions, please feel free to reach out to us: https://petabridge.com/