I have an abstract question.
I need a service with fault tolerance. The service only can only be running on one node at a time. This is the key.
With two connected nodes: A and B.
I have think about heartbeat protocol for sync the status of the nodes and detect timeouts, however there are a lot of race conditions.
I can add a third node with a global lock, but I'm not sure about how to do this.
Anybody know any well-known algorithm to do this? Or better Is there any open source software that lets me control this kind of things?
Thanks
If you can provide some sort of a shared memory between nodes, then there is the classical algorithm that resolves this problem, called Peterson's algorithm.
It is based on two additional variables, called flag
and turn
. Turn
is an integer variable whose value represents an index of node that is allowed to be active at the moment. In other words, turn=1
indicates that node no 1 has right to be active, and other node should wait. In other words, it is his turn to be active - that's where the name comes from.
Flag
is a boolean array where flag[i]
indicates that i-th node declares itself as ready for service. In your setup, flag[i]=false
means that i-th node is down. Key part of the algorithm is that a node which is ready for service (i.e. flag[i] = true
) has to wait until he obtains turn
.
Algorithm is originally developed for resolving a problem of execution a critical section without conflict. However, in your case a critical section is simply running the service. You just have to ensure that before i-th node is turned off, it sets flag[i]
to false. This is definitely a tricky part because if a node crashes, it obviously cannot set any value. I would go here with a some sort of a heartbeat.
Regarding the open source software that resolves similar problems, try searching for "cluster failover". Read about Google's Paxos and Google FileSystem. There are plenty of solutions, but if you want to implement something by yourself, I would try Peterson's algorithm.