Search code examples
akkaakka.netakka.net-cluster

Akka.NET cluster intermittent dead letters


We have our cluster running locally (for now) and everything seems to be configured correctly. Our prime calculation messages are distributed over our seednodes. However, we are intermittently losing messages. You can see the behaviour of two runs in the screenshot. Which messages are marked as dead letters isn't consistent at all.

Our messages are always sent the same way, they look like this. The last parameter means the nth prime to find.

new PrimeCalculationEntry(id, 1, 100000),
new PrimeCalculationEntry(id, 2, 150000),
new PrimeCalculationEntry(id, 3, 200000),
new PrimeCalculationEntry(id, 4, 250000),
new PrimeCalculationEntry(id, 5, 300000),
new PrimeCalculationEntry(id, 6, 350000),
new PrimeCalculationEntry(id, 7, 400000),
new PrimeCalculationEntry(id, 8, 450000)

Screenshot

Our cluster is set up like this: One non-seednode which is a group router and sends messages to two seednodes, which are configured as pool routers.

Non seednode: localhost:0 (random port)

akka {
            actor {
                provider = cluster
                deployment {
                    /commander {
                        router = round-robin-group # routing strategy
                        routees.paths = ["/user/cluster"] # path of routee on each node
                        cluster {
                            enabled = on
                            allow-local-routees = on
                        }
                    }
                }
            }
            remote {
                dot-netty.tcp {
                    port = 0 #let os pick random port
                    hostname = localhost
                }
            }
            cluster {
                seed-nodes = ["akka.tcp://ClusterSystem@localhost:8081", "akka.tcp://ClusterSystem@localhost:8082"]
            }
        }

Seednode 1: localhost:8081 (leader)

akka {
            actor {
                provider = cluster
                deployment {
                    /cluster {
                        router = round-robin-pool
                        nr-of-instances = 10
                    }
                }
            }
            remote {
                dot-netty.tcp {
                    port = 8081
                    hostname = localhost
                }
            }
            cluster {
                seed-nodes = ["akka.tcp://ClusterSystem@localhost:8081"]
            }
        }

Seednode 2: localhost:8082

akka {
            actor {
                provider = cluster
                deployment {
                    /cluster {
                        router = round-robin-pool
                        nr-of-instances = 10
                    }
                }
            }
            remote {
                dot-netty.tcp {
                    port = 8082
                    hostname = localhost
                }
            }
            cluster {
                seed-nodes = ["akka.tcp://ClusterSystem@localhost:8081"]
            }
        }

Can anyone point us in the right direction? Any issues with our configuration? Thank you in advance.


Solution

  • I think I know what the issue is here - you don't have any akka.cluster.roles defined nor is your /commander router configured with the use-role setting - so as a result, every Nth message is being dropped because it's trying to route a message to itself and does not have a /user/cluster actor present to receive it.

    To fix this properly, we should do the following:

    1. Have all nodes that can process the PrimeCalculationEntry declare akka.cluster.roles=[prime]
    2. Have the node with the /commander router change its HOCON to:
         /commander {
            router = round-robin-group # routing strategy
            routees.paths = ["/user/cluster"] # path of routee on each node
            cluster {
                enabled = on
                allow-local-routees = on
                use-role = "prime"
            }
        }
    

    This will eliminate the deadletters as the /commander node will no longer be sending messages to itself every N iterations.