Running Kubernetes in multimaster mode

I have set a kubernetes (version 1.6.1) cluster with three servers in control plane. Apiserver is running with the following config:

/usr/bin/kube-apiserver \
  --admission-control=NamespaceLifecycle,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota \
  --advertise-address=x.x.x.x \
  --allow-privileged=true \
  --audit-log-path=/var/lib/k8saudit.log \
  --authorization-mode=ABAC \
  --authorization-policy-file=/var/lib/kubernetes/authorization-policy.jsonl \
  --bind-address=0.0.0.0 \
  --etcd-servers=https://kube1:2379,https://kube2:2379,https://kube3:2379 \
  --etcd-cafile=/etc/etcd/ca.pem \
  --event-ttl=1h \
  --insecure-bind-address=0.0.0.0 \
  --kubelet-certificate-authority=/var/lib/kubernetes/ca.pem \
  --kubelet-client-certificate=/var/lib/kubernetes/kubernetes.pem \
  --kubelet-client-key=/var/lib/kubernetes/kubernetes-key.pem \
  --kubelet-https=true \
  --service-account-key-file=/var/lib/kubernetes/ca-key.pem \
  --service-cluster-ip-range=10.32.0.0/24 \
  --service-node-port-range=30000-32767 \
  --tls-cert-file=/var/lib/kubernetes/kubernetes.pem \
  --tls-private-key-file=/var/lib/kubernetes/kubernetes-key.pem \
  --token-auth-file=/var/lib/kubernetes/token.csv \
  --v=2 \
  --apiserver-count=3 \
  --storage-backend=etcd2

Now I am running kubelet with following config:

/usr/bin/kubelet \
  --api-servers=https://kube1:6443,https://kube2:6443,https://kube3:6443 \
  --allow-privileged=true \
  --cluster-dns=10.32.0.10 \
  --cluster-domain=cluster.local \
  --container-runtime=docker \
  --network-plugin=kubenet \
  --kubeconfig=/var/lib/kubelet/kubeconfig \
  --serialize-image-pulls=false \
  --register-node=true \
  --cert-dir=/var/lib/kubelet \
  --tls-cert-file=/var/lib/kubernetes/kubelet.pem \
  --tls-private-key-file=/var/lib/kubernetes/kubelet-key.pem \
  --hostname-override=node1 \
  --v=2

This works great as long as kube1 is running. If I take kube1 down, the node does not communicate with kube2 or kube3. It always takes up the first apiserver passed to the --api-servers flag and does not failover in case the first apiserver crashes. What is the correct way to do a failover in case one of the apiserver fails?

Solution

The --api-servers flag is deprecated. It's no longer in the documentation. kubeconfig is the brand new way to point kubelet to kube-apiserver.

The kosher way to do this today is to deploy a Pod with nginx on each worker node (ie. the ones running kubelet) that load-balances between the 3 kube-apiservers. nginx will know when one master goes down and not route traffic to it; that's its job. The kubespray project uses this method.

The 2nd, not so good way, is to use DNS RR. Create a DNS "A" record for the IPs of the 3 masters. Point kubelet to this RR hostname instead of the 3x IPs. Each time kubelet contacts a master, it will be routed to the IP in the RR list. This technique isn't robust because traffic will still be routed to the downed node, so the cluster will experience intermittent outage.

The 3rd, and more complex method imho, is to use keepalived. keepalived uses VRRP to ensure that at least one node owns the Virtual IP (VIP). If a master goes down, another master will hijack the VIP to ensure continuity. The bad thing about this method is that load-balancing doesn't come as a default. All traffic will be routed to 1 master (ie. the primary VRRP node) until it goes down. Then the secondary VRRP node will take over. You can see the nice write-up I contributed at this page :)

More details about kube-apiserver HA here. Good luck!