Search code examples
elasticsearchreplicationrecoveryfault-tolerance

How does Elasticsearch recover from a quorum that is not unanimous


When using replication with a quorum, Elasticsearch allows writes to fail for some (a small number of) replica shards. Writing to a replica might fail only because it is temporarily unavailable (because of a temporary network partition, for example). When that shard becomes available again (the network is fixed, for example), what happens?

Does Elasticsearch automatically detect that the shard is out of date (stale, inconsistent with the primary shard) and update it in the background? Or must you perform a manual operation? When the shard returns from being unavailable, but is out of date, does Elasticsearch automatically refrain from querying that shard (and retrieving stale data) until it is brought up to date? Or must you provide special query parameters it ensure that out-of-date shards are not used?


Solution

  • Elasticsearch manages automatically the replica that it are out of date. No manual operation or special query are necessary.

    In case of nodes/network failure you have to ensure that a quorum of the cluster remain online, otherwise you will encounter the split brain problem in which you cannot known which of the replica is in line and which is out of date.