Search code examples
f#akka.netakka.net-clusterakka.fsharp

How do I implement Failover within an Akka.NET cluster using the Akka.FSharp API?


How do I implement Failover within an Akka.NET cluster using the Akka.FSharp API?

I have the following cluster node that serves as a seed:

open Akka
open Akka.FSharp
open Akka.Cluster
open System
open System.Configuration

let systemName = "script-cluster"
let nodeName = sprintf "cluster-node-%s" Environment.MachineName
let akkaConfig = Configuration.parse("""akka {  
                                          actor {
                                            provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
                                          }
                                          remote {
                                            log-remote-lifecycle-events = off
                                            helios.tcp {
                                                hostname = "127.0.0.1"
                                                port = 2551       
                                            }
                                          }
                                          cluster {
                                            roles = ["seed"]  # custom node roles
                                            seed-nodes = ["akka.tcp://script-cluster@127.0.0.1:2551"]
                                            # when node cannot be reached within 10 sec, mark is as down
                                            auto-down-unreachable-after = 10s
                                          }
                                        }""")
let actorSystem = akkaConfig |> System.create systemName

let clusterHostActor =
    spawn actorSystem nodeName (fun (inbox: Actor<ClusterEvent.IClusterDomainEvent>) -> 
        let cluster = Cluster.Get actorSystem
        cluster.Subscribe(inbox.Self, [| typeof<ClusterEvent.IClusterDomainEvent> |])
        inbox.Defer(fun () -> cluster.Unsubscribe(inbox.Self))
        let rec messageLoop () = 
            actor {
                let! message = inbox.Receive()                        
                // TODO: Handle messages
                match message with
                | :? ClusterEvent.MemberJoined as event -> printfn "Member %s Joined the Cluster at %O" event.Member.Address.Host DateTime.Now
                | :? ClusterEvent.MemberLeft as event -> printfn "Member %s Left the Cluster at %O" event.Member.Address.Host DateTime.Now
                | other -> printfn "Cluster Received event %O at %O" other DateTime.Now

                return! messageLoop()
            }
        messageLoop())

I then have an arbitrary node that could die:

open Akka
open Akka.FSharp
open Akka.Cluster
open System
open System.Configuration

let systemName = "script-cluster"
let nodeName = sprintf "cluster-node-%s" Environment.MachineName
let akkaConfig = Configuration.parse("""akka {  
                                          actor {
                                            provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
                                          }
                                          remote {
                                            log-remote-lifecycle-events = off
                                            helios.tcp {
                                                hostname = "127.0.0.1"
                                                port = 0       
                                            }
                                          }
                                          cluster {
                                            roles = ["role-a"]  # custom node roles
                                            seed-nodes = ["akka.tcp://script-cluster@127.0.0.1:2551"]
                                            # when node cannot be reached within 10 sec, mark is as down
                                            auto-down-unreachable-after = 10s
                                          }
                                        }""")
let actorSystem = akkaConfig |> System.create systemName

let listenerRef =  
    spawn actorSystem "temp2"
    <| fun mailbox ->
        let cluster = Cluster.Get (mailbox.Context.System)
        cluster.Subscribe (mailbox.Self, [| typeof<ClusterEvent.IMemberEvent>|])
        mailbox.Defer <| fun () -> cluster.Unsubscribe (mailbox.Self)
        printfn "Created an actor on node [%A] with roles [%s]" cluster.SelfAddress (String.Join(",", cluster.SelfRoles))
        let rec seed () = 
            actor {
                let! (msg: obj) = mailbox.Receive ()
                match msg with
                | :? ClusterEvent.MemberRemoved as actor -> printfn "Actor removed %A" msg
                | :? ClusterEvent.IMemberEvent           -> printfn "Cluster event %A" msg
                | _ -> printfn "Received: %A" msg
                return! seed () }
        seed ()

What is the recommended practice for implementing failover within a cluster?

Specifically, is there a code example of how a cluster should behave when one of its nodes is no longer available?

  • Should my cluster node spin-up a replacement or is there a different behavior?
  • Is there a configuration that automatically handles this that I can just set without having to write code?
  • What code would I have to implement and where?

Solution

  • First of all it's a better idea to rely on MemberUp and MemberRemoved events (both implementing ClusterEvent.IMemberEvent interface, so subscribe for it), as they mark phases, when node joining/leaving procedure has been completed. Joined and left events doesn't necessarily ensure that node is fully operable at signaled point in time.

    Regarding failover scenario:

    • Automatic spinning of the replacements can be done via Akka.Cluster.Sharding plugin (read articles 1 and 2 to get more info about how does it work). There's no equivalent in Akka.FSharp for it, but you may use Akkling.Cluster.Sharding plugin instead: see example code.
    • Another way is to create replacement actors up-front on each of the nodes. You can route messages to them by using clustered routers or distributed publish/subscribe. This is however more a case in situation, when you have stateless scenarios, so that each actor is perfectly able to pick up work of another actor at any time. This is more general solution for distributing work among many actors living on many different nodes.
    • You may also set watchers over processing actors. By using monitor function, you may order your actor to watch over another actor (no matter where does it live). In case of node failure, info about dying actor will be send in form of Terminated message to all of its watchers. This way you may implement your own logic i.e. recreating actor on another node. This is actually the most generic way, as it doesn't use any extra plugins or configuration, but the behavior needs to be described by yourself.