Search code examples
elasticsearchkuberneteskubernetes-operator

Elasticsearch 7.2.0: master not discovered or elected yet, an election requires at least X nodes


I'm trying to automate the process of horizontal scale up and scale down of elasticsearch nodes in kubernetes cluster.

Initially, I deployed an elasticsearch cluster (3 master, 3 data & 3 ingest nodes) on a Kubernetes cluster. Where, cluster.initial_master_nodes was:

cluster.initial_master_nodes:
  - master-a
  - master-b
  - master-c

Then, I performed scale down operation, reduced the number of master node 3 to 1 (unexpected, but for testing purpose). While doing this, I deleted master-c, master-b nodes and restarted master-a node with the following setting:

cluster.initial_master_nodes:
  - master-a

Since the elasticsearch nodes (i.e. pods) use persistant volume, after restarting the node, the master-a slowing the following logs:

"message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [TxdOAdryQ8GAeirXQHQL-g, VmtilfRIT6KDVv1R6MHGlw, KAJclUD2SM6rt9PxCGACSA], have discovered [] which is not a quorum; discovery will continue using [] from hosts providers and [{master-a}{VmtilfRIT6KDVv1R6MHGlw}{g29haPBLRha89dZJmclkrg}{10.244.0.95}{10.244.0.95:9300}{ml.machine_memory=12447109120, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 5, last-accepted version 40 in term 5"  }

Seems like it's trying to find master-b and master-c.

Questions:

  • How to overwrite cluster settings so that master-a won't search for these deleted nodes?

Solution

  • The cluster.initial_master_nodes setting only has an effect the first time the cluster starts up, but to avoid some very rare corner cases you should never change its value once you've set it and generally you should remove it from the config file as soon as possible. From the reference manual regarding cluster.initial_master_nodes:

    You should not use this setting when restarting a cluster or adding a new node to an existing cluster.

    Aside from that, Elasticsearch uses a quorum-based election protocol and says the following:

    To be sure that the cluster remains available you must not stop half or more of the nodes in the voting configuration at the same time.

    You have stopped two of your three master-eligible nodes at the same time, which is more than half of them, so it's expected that the cluster no longer works.

    The reference manual also contains instructions for removing master-eligible nodes which you have not followed:

    As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time, allowing enough time for the cluster to automatically adjust the voting configuration and adapt the fault tolerance level to the new set of nodes.

    If there are only two master-eligible nodes remaining then neither node can be safely removed since both are required to reliably make progress. To remove one of these nodes you must first inform Elasticsearch that it should not be part of the voting configuration, and that the voting power should instead be given to the other node.

    It goes on to describe how to safely remove the unwanted nodes from the voting configuration using POST /_cluster/voting_config_exclusions/node_name when scaling down to a single node.