Search code examples
cluster-computingdistributedleader-election

Why is it recommended to create clusters with odd number of nodes


There are several resources about distributed systems, like the mongo db documentation that recommend odd number of nodes in a cluster.

What are the benefits of having odd number of nodes?


Solution

  • Short answer: in this case of MongoDB, having an odd number of nodes increases your clustered system's availability (uptime).

    Look at the table in the MongoDB documentation you linked:

    +-------------------+------------------------------------------+-----------------+
    | Number of Members | Majority Required to Elect a New Primary | Fault Tolerance |
    +-------------------+------------------------------------------+-----------------+
    |         3         |                    2                     |        1        |
    +-------------------+------------------------------------------+-----------------+
    |         4         |                    3                     |        1        |
    +-------------------+------------------------------------------+-----------------+
    |         5         |                    3                     |        2        |
    +-------------------+------------------------------------------+-----------------+
    |         6         |                    4                     |        2        |
    +-------------------+------------------------------------------+-----------------+
    

    Notice that how when you have an odd number of members and add one more (to become even), your fault tolerance does not go up! (Meaning, your cluster cannot tolerate more failed members than it originally could)

    This is because MongoDB requires a majority of members to be up in order to elect a primary. This property is not specific to MongoDB, but any clustered system that requires a majority of members to be up (for example, see also etcd).

    Your system availability actually goes down when increasing to an even number of nodes because, although your fault tolerance remains the same, there are more nodes that can fail so the probability of a fault occurring goes up.

    In addition, having an even number of members decreases the probability that if there is a network partition then some subset of your nodes will be able to continue running. For example, if you've got a 6 node cluster then it opens up the possibility that a network partition could partition your nodes into 2 3-node partitions. In such a case then neither partition will be able to communicate with a majority of members and your cluster becomes unavailable.

    The counter-intuitive conclusion is that, if you have an even-membered cluster then it is actually beneficial (from a high-availability standpoint) to remove one of the members.