I am trying to configure a K8s cluster on-prem and the servers are running Fedora CoreOS using multiple NICs.
I am configuring the cluster to use a non-default NIC - a bond which is defined with 2 interfaces. All servers can reach each-other over that interface and have HTTP + HTTPS connectivity to the internet.
kubeadm join hangs at:
I0513 13:24:55.516837 16428 token.go:215] [discovery] Failed to request cluster-info, will try again: Get https://${BOND_IP}:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The relevant kubeadm init config looks like this:
[...]
localAPIEndpoint:
advertiseAddress: ${BOND_IP}
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
node-ip: ${BOND_IP}
criSocket: /var/run/dockershim.sock
name: master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
[...]
The join config that am using looks like this:
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
token: ${TOKEN}
caCertHashes:
- "${SHA}"
apiServerEndpoint: "${BOND_IP}:6443"
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
If I am trying to configure it using default eth0, it works without issues.
This is not a connectivity issue. The port test works fine:
# nc -s ${BOND_IP_OF_NODE} -zv ${BOND_IP_OF_MASTER} 6443
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Connected to ${BOND_IP_OF_MASTER}:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
I suspect this happens due to kubelet listening on eth0, if so, can I change it to use a different NIC/IP?
LE: The eth0 connection has been cut off completely (cable out, interface down, connection down). Now, when we init, if we choose port 0.0.0.0 for the kube-api it defaults to the bond, which we wanted initially:
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 0.0.0.0
result:
[certs] apiserver serving cert is signed for DNS names [emp-prod-nl-hilv-quortex19 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.0.0.1 ${BOND_IP}]
I have even added the 6443 port in iptables for accept and it still times out.. All my CALICO pods are up and running (all pods for that matter in kube-system namespace)
LLE:
I have tested calico and weavenet and both show the same issue. The api-server is up and can be reached from the master using curl but it times out from the nodes.
LLLE:
On the premise that the kube-api is nothing but an HTTPS server, I have tried two options from the node that cannot reach it when doing the kubeadm join:
the node just cant reach the api-server on 6443 or any other port for that matter ....
what am i doing wrong...
The cause:
The interface used was in BOND of type ACTIVE-ACTIVE. This made it so kubeadm tried another interface from the 2 bonded, which was not in the same subnet as the IP of the advertised server apparently...
Using ACTIVE-PASSIVE did the trick and was able to join the nodes.
LE: If anyone knows why kubeadm join does not support LACP with ACTIVE-ACTIVE bond setups on FEDORA COREOS please advise here. Otherwise, if additional configurations are required, I would very much like to know what I have missed.