Search code examples
microservicesconsulconsul-templateconsul-health-checkhashicorp

Consul Data Center: Leader node not automatically selected after failure of previous leader node


I'm new to Consul and I have created a one Data Center with 2 server nodes. I followed the steps provided in this documentation, https://learn.hashicorp.com/tutorials/consul/deployment-guide?in=consul/datacenter-deploy

The nodes are successfully created, and they both are in sync when I launch a service. Every thing is working fine till this step.

However I face an issue in case where the leader node fails (goes offline). In that case, the follower node DOES NOT automatically assume the role of leader node and Consul as whole becomes inaccessible for the service. Even the follower node stops responding to requests even though it is still running.

Can anyone help me understand what exactly is wrong with my setup and how can I keep my setup still working with the follower node automatically becoming leader node and respond to queries from API Gateway?

The below documentation gives some pointer and talks about fulfilling a 'Quorum' for automatic selection of a leader. I'm not sure if it is applicable in this case of mine?

https://learn.hashicorp.com/tutorials/consul/recovery-outage-primary?in=consul/datacenter-operations#outage-event-in-the-primary-datacenter

Edit:

consul.hcl

First Server:

datacenter = "dc1"
data_dir = "D:/Hashicorp/Consul/data"
encrypt = "<key>"
ca_file = "D:/Hashicorp/Consul/certs/consul-agent-ca.pem"
cert_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-0.pem"
key_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-0-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
retry_join = ["<ip1>", "<ip2>"]

Second Server:

datacenter = "dc1"
data_dir = "D:/Hashicorp/Consul/data"
encrypt = "<key>"
ca_file = "D:/Hashicorp/Consul/certs/consul-agent-ca.pem"
cert_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-1.pem"
key_file = "D:/Hashicorp/Consul/certs/dc1-server-consul-1-key.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true
retry_join = ["<ip1>", "<ip2>"]

server.hcl:

First Server:

server = true
bootstrap_expect = 2
client_addr = "<ip1>"
ui = true

Second Server:

server = true
bootstrap_expect = 2
client_addr = "<ip2>"
ui = true

Solution

  • The size of the cluster and the ability to form a quorum is absolutely applicable in this case. You will need a minimum of 3 nodes in the cluster in order to tolerate a failure of one node without sacrificing the availability of the cluster.

    I recommend reading Consul's Raft Protocol Overview as well as reviewing the deployment table at the bottom of the page to help understand the failure tolerance provided by using various cluster sizes.