Search code examples
elasticsearchopensearchsplitbrain

What will happen If I lost the network connection between 2 zones?


There are 3 bare-metal servers in 3 different zones. Each zone connected to each other with dark fiber cables.

Let's say the network-3 is down temporarily and the network-1 and network-2 is healthy. So zone-3 can communicate with both zone-1 and zone-2 but zone-1 and zone-2 cannot communicate between each other.

Application sending indexing/search requests to both Zone-1 and Zone-2.

  1. What will happen in that scenario?
  2. Should I worry about split brain?

enter image description here


Solution

  • I can only answer about elasticsearch here. In case of elasticsearch,

    1. What will happen in that scenario?

    The tiebreaker node will vote for one of the nodes either in zone-1 or zone-2 and with two votes (tiebreaker + self) this node will be elected as a master. Assuming that the master in zone-1 was elected the master node in zone-2 will continue to unsuccessfully obtain quorum but until it can achieve that it will not be able to become a master. Meanwhile all nodes in the zone-2 will not be able to connect to the master in zone-1 and as a result will go into the mode where no master is available. What they will be able to do depends on the current value of the cluster.no_master_block setting.

    • If it is set to write (which is default), only search operations will be permitted.
    • If it is set to all both indexing and searching operations will end in failure.
    • If it is set to metadata_write searching will work but indexing might or might not work depending on the primary shard allocation at the moment of failure. See Reading and Writing documents section of documentation for a more detailed analysis.
    1. Should I worry about split brain?

    No. This is a somewhat common setup for small clusters. Saying this depending on your configuration, how your clients are routed to the nodes in the cluster, the size of the cluster and load you might experience some degradation of service is to be expected. See some additional recommendations for setting up such clusters in the Resilience in small clusters section of documentation.