Quoting the Erlang documentation
As of OTP 25, global will by default prevent overlapping partitions due to network issues by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.
Now, I ran a set of experiments where 3 nodes A, B, C, forming a fully connected network (so for instance a nodes()
on A would evaluate to [B,C]
), experience a failure of the link between B and C.
Then, I observed that in some cases the resulting fully connected network was formed by A and B and in other cases it was formed by A and C.
Graphically:
Fully connected After fault Scenario 1 Scenario 2
A A A A
/ \ / \ / \
/ \ / \ / \
/ \ / \ / \
B---------C B C B C B C
I could not find a specification of the algorithm, so the question is: could you provide me with one, more or less formal? Or if it already exists in the documentation, could you point it out to me?
Thanks in advance.
After quite a bit of digging in the source code I was able to find an informal explanation of how the algorithm works. This excerpt is taken from here
%% ----------------------------------------------------------------
%% Prevent Overlapping Partitions Algorithm
%% ========================================
%%
%% 1. When a node lose connection to another node it sends a
%% {lost_connection, LostConnNode, OtherNode} message to all
%% other nodes that it knows of.
%% 2. When a lost_connection message is received the receiver
%% first checks if it has seen this message before. If so, it
%% just ignores it. If it has not seen it before, it sends the
%% message to all nodes it knows of. This in order to ensure
%% that all connected nodes will receive this message. It then
%% sends a {remove_connection, LostConnRecvNode} message (where
%% LostConnRecvNode is its own node name) to OtherNode and
%% clear all information about OtherNode so OtherNode wont be
%% part of ReceiverNode's cluster anymore. When this information
%% has been cleared, no lost_connection will be triggered when
%% a nodedown message for the connection to OtherNode is
%% received.
%% 3. When a {remove_connection, LostConnRecvNode} message is
%% received, the receiver node takes down the connection to
%% LostConnRecvNode and clears its information about
%% LostConnRecvNode so it is not part of its cluster anymore.
%% Both nodes will receive a nodedown message due to the
%% connection being closed, but none of them will send
%% lost_connection messages since they have cleared information
%% about the other node.