I am implementing Paxos in a cluster simulator application, using the documentation available in Wikipedia. Unfortunately, it leaves several doors open to interpretation and does not provide much information on key implementation issues. It is unclear and incomplete.
Isn't Paxos going to enter an infinite loop? I guess one should not initiate Paxos if one cannot communicate with at least a quorum of nodes.
What is 'the last value it accepted'? Is it any previous proposal number from the proposer? What does 'instance' refer to exactly in this case?
In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?
In Phase 2a: 'If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N'.
What is value here? Is it the proposal number? I believe not, but this phrase is unclear.
In Phase 2a: 'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?
Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?
The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?
P.S.: I don't have enough reputation to create a 'Paxos' tag (any volunteer?)
What is an instance?
The nomenclature in Paxos is a little unintuitive.
Assuming a cluster divided in 3 regions, each containing 3 nodes (total = 9 nodes). What happens if communication is broken between regions? There is no way any leader can reach quorum (which is 5).
Paxos requires you can get at least a quorum (5 nodes in your case). Go with your solution of three regions; having two network partitions between the three regions is very bad news. I also use a version of Paxos which can change node membership from one instance to the next. This is useful for partitions and node failure.
Isn't Paxos going to enter an infinite loop?
A naive implementation of Paxos is not guaranteed to terminate because multiple nodes can leap-frog Prepare phases. There are two ways of getting around this. One is to have a random backoff before starting new Prepare phases. The second is to route all requests to a designated leader, that acts as proposer (The leader is chosen by a Paxos instance. See also Multi-paxos)
In Phase 1b: 'If the proposal number N is larger than any previous proposal, then each >>Acceptor promises not to accept proposals less than N, and sends the value it last accepted for >>this instance to the Proposer'.
What is 'the last value it accepted'? Is it any previous proposal number from the proposer?
When a node receives an Accept!(roundId, value) message from a Proposer and it hasn't promised to not accept the value (due to a Prepare!(higherRoundId) message), it stores the value and the roundId (I'll call them acceptedValue
and acceptedRoundId
). It may write over these due to subsequent Accept!(...) messages.
When a node receives a Prepare!(roundId) message from a Proposer, it stores roundId as promiseRoundId = max(roundId, promiseRoundId)
. It then sends a Promise!(acceptedRoundId, acceptedValue)
back to the Proposer. NB: if a node hasn't received an Accept!(...) message, it replies with Promise!(null, null)
.
In Phase 1a: Does one include the value to agree on with the Prepare message or is this deferred to the Accept! message? Or it does matter?
There is no need to send it. I don't.
In Phase 2a: 'If any of the Acceptors have already accepted a value, the leader must Choose a value with the maximum proposal number N'.
What is value here? Is it the proposal number? I believe not, but this phrase is unclear.
The value is the actual data the algorithm is reaching consensus on. I'll rephrase this to
To start the Accept Phase, The Proposer must choose a value to be accepted depending on the results of the Prepare phase. If any Acceptor replied with Promise(roundId, value), the Proposer must use the value associated with the highest roundId. Otherwise, the Proposer received only Promise(null, null), and may choose any value to send to the acceptors.
NB: Proposal number here is the same thing as roundId.
In Phase 2a: 'Otherwise, the Proposer is free to choose any value'. What does this mean? A value for what? For the proposal number?
This is the value you want to have consensus on. This is typically a state change across the distributed system, perhaps triggered by a client request.
Paxos seems to rely on an increasing value of N (proposal number) to work? Is this correct?
The wikipedia entry does not discuss the initial values a node should set before starting to participate in Paxos. What are these?
Round ids (aka proposal numbers) should be increasing and must be unique per instance across all nodes. The Paxos paper assumes you can do this because it is trivial to achieve. Here's one scheme that produces the same results on all nodes:
roundId = i*M + index[node]
where i is the ith round this node is starting (that is i is unique per node per paxos instance, and is monotonically increasing).Or in pseudo-code (which is clearly lacking a few major optimizations):
define runPaxos( allNodesThisPaxosInstance, myValue ) {
allNodesThisPaxosInstance.sort()
offset = allNodesThisPaxosInstance.indexOf( thisNode )
for (i = 0; true; i++) {
roundId = offset + i * allNodesThisPaxosInstance.size()
prepareResult = doPreparePhase( roundId )
if (!prepareResult.shouldContinue?)
return
if (prepareResult.hasAnyValue?)
chosenValue = prepareResult.valueWithHighestRoundId
else
chosenValue = myValue
acceptResult = doAcceptPhase( roundId, chosenValue )
if (!acceptResult.shouldContinue?)
return
}
}