Search code examples
kubernetesetcdetcdctl

ETCD Cluster getting rpc error: code = DeadlineExceeded desc = context deadline exceeded


just looking for some clarification here I have a 2 node etcd cluster:

master01=http://10.1.1.21:2379,master02=http://10.1.1.22:2379

all running fine. If I login to master01 and do the following:

etcdctl --cluster=true endpoint health

i get a good response:

http://10.1.1.21:2379 is healthy: successfully committed proposal: took = 25.628392ms
http://10.1.1.22:2379 is healthy: successfully committed proposal: took = 42.98645ms

all operations get, put are running as expected.

ETCDCTL_API=3 etcdctl --endpoints=http://10.1.1.21:2379,http://10.1.1.22:2379 get date

The trouble starts when I drop one of the nodes, so if i kill one node I am now getting errors instead of results for example:

ETCDCTL_API=3 etcdctl --endpoints=http://10.1.1.21:2379,http://10.1.1.22:2379 get date
{"level":"warn","ts":"2021-09-09T08:58:22.175Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000e0a80/#initially=[http://10.1.1.21:2379;http://10.1.1.22:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

in this case i killed off master01, am i doing something wrong?


Solution

  • An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state. For a cluster with n members, quorum is (n/2)+1. For any odd-sized cluster, adding one node will always increase the number of nodes necessary for quorum. Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse since exactly the same number of nodes may fail without losing quorum but there are more nodes that can fail. If the cluster is in a state where it can’t tolerate any more failures, adding a node before removing nodes is dangerous because if the new node fails to register with the cluster (e.g., the address is misconfigured), quorum will be permanently lost.

    So, in your case having two etcd nodes provide the same redundancy as one, so always recommended to have odd number of etcd nodes. code = DeadlineExceeded desc = context deadline exceeded means that client is not able to reach etcd server and it timeouts. So it might the case, that you are trying to connect to a etcd server which is down and as a results you see the error. Please refer the below doc to know more

    ETDC FALUT TOLERANCE TABLE