Rancher: kubernetes cluster stuck in pending. "No route to host"

I built a kubernetes cluster on CentOS 8 first. I followed the how-to found here: https://www.tecmint.com/install-a-kubernetes-cluster-on-centos-8/

And then I built an Ubuntu 18.04 VM and installed Rancher on it. I can access the Rancher website just fine and all appears to be working on the rancher side, except I can't add my kubernetes cluster to it.

When I use the "Add Cluster" feature, I chose the "Other Cluster" option, give it a name, and then click create. I then copy the insecure "Cluster Registration Command" to the master node. It appears to take the command just fine.

In troubleshooting, I've issued the following command: kubectl -n cattle-system logs -l app=cattle-cluster-agent

The output I get is as follows:

INFO: Environment: CATTLE_ADDRESS=10.42.0.1 CATTLE_CA_CHECKSUM=94ad10e756d390cdf8b25465f938c04344a396b16b4ff6c0922b9cd6b9fc454c CATTLE_CLUSTER=true CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES= CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-7b9df685cf-9kr4p CATTLE_SERVER=https://192.168.188.189:8443
INFO: Using resolv.conf: nameserver 10.96.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local options ndots:5
ERROR: https://192.168.188.189:8443/ping is not accessible (Failed to connect to 192.168.188.189 port 8443: No route to host)
INFO: Environment: CATTLE_ADDRESS=10.40.0.0 CATTLE_CA_CHECKSUM=94ad10e756d390cdf8b25465f938c04344a396b16b4ff6c0922b9cd6b9fc454c CATTLE_CLUSTER=true CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES= CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-7bc7687557-tkvzt CATTLE_SERVER=https://192.168.188.189:8443
INFO: Using resolv.conf: nameserver 10.96.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local options ndots:5
ERROR: https://192.168.188.189:8443/ping is not accessible (Failed to connect to 192.168.188.189 port 8443: No route to host)
[root@k8s-master ~]# ping 192.168.188.189
PING 192.168.188.189 (192.168.188.189) 56(84) bytes of data.
64 bytes from 192.168.188.189: icmp_seq=1 ttl=64 time=0.432 ms
64 bytes from 192.168.188.189: icmp_seq=2 ttl=64 time=0.400 ms
^C
--- 192.168.188.189 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.400/0.416/0.432/0.016 ms

As you can see I'm getting a "No route to host" error message. But, I can ping the rancher VM using its IP address.

It appears to be attempting to use resolv.conf inside the cluster and looking to use a 10.96.0.10 to resolve the ip address of 192.168.188.189 (my Rancher VM). But it appears to be failing to resolve it.

I'm thinking I have some sort of DNS issue that's preventing me from using hostnames. Though I've edited the /etc/hosts file on the master and worker nodes to include entries for each of the devices. I can ping devices using their hostname, but I can't reach a pod using :. I get a "No route to host" error message when I try that too. See here:

[root@k8s-master ~]# ping k8s-worker1
PING k8s-worker1 (192.168.188.191) 56(84) bytes of data.
64 bytes from k8s-worker1 (192.168.188.191): icmp_seq=1 ttl=64 time=0.478 ms
64 bytes from k8s-worker1 (192.168.188.191): icmp_seq=2 ttl=64 time=0.449 ms
^C
--- k8s-worker1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.449/0.463/0.478/0.025 ms
[root@k8s-master ~]# kubectl get svc
NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
hello-world   NodePort    10.103.5.49     <none>        8080:30370/TCP   45m
kubernetes    ClusterIP   10.96.0.1       <none>        443/TCP          26h
nginx         NodePort    10.97.172.245   <none>        80:30205/TCP     3h43m
[root@k8s-master ~]# kubectl get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP          NODE          NOMINATED NODE   READINESS GATES
hello-world-7884c6997d-2dc9z   1/1     Running   0          28m     10.40.0.4   k8s-worker3   <none>           <none>
hello-world-7884c6997d-562lh   1/1     Running   0          28m     10.35.0.8   k8s-worker2   <none>           <none>
hello-world-7884c6997d-78dmm   1/1     Running   0          28m     10.36.0.3   k8s-worker1   <none>           <none>
hello-world-7884c6997d-7vt4f   1/1     Running   0          28m     10.40.0.6   k8s-worker3   <none>           <none>
hello-world-7884c6997d-bpq5g   1/1     Running   0          49m     10.36.0.2   k8s-worker1   <none>           <none>
hello-world-7884c6997d-c529d   1/1     Running   0          28m     10.35.0.6   k8s-worker2   <none>           <none>
hello-world-7884c6997d-ddk7k   1/1     Running   0          28m     10.36.0.5   k8s-worker1   <none>           <none>
hello-world-7884c6997d-fq8hx   1/1     Running   0          28m     10.35.0.7   k8s-worker2   <none>           <none>
hello-world-7884c6997d-g5lxs   1/1     Running   0          28m     10.40.0.3   k8s-worker3   <none>           <none>
hello-world-7884c6997d-kjb7f   1/1     Running   0          49m     10.35.0.3   k8s-worker2   <none>           <none>
hello-world-7884c6997d-nfdpc   1/1     Running   0          28m     10.40.0.5   k8s-worker3   <none>           <none>
hello-world-7884c6997d-nnd6q   1/1     Running   0          28m     10.36.0.7   k8s-worker1   <none>           <none>
hello-world-7884c6997d-p6gxh   1/1     Running   0          49m     10.40.0.1   k8s-worker3   <none>           <none>
hello-world-7884c6997d-p7v4b   1/1     Running   0          28m     10.35.0.4   k8s-worker2   <none>           <none>
hello-world-7884c6997d-pwpxr   1/1     Running   0          28m     10.36.0.4   k8s-worker1   <none>           <none>
hello-world-7884c6997d-qlg9h   1/1     Running   0          28m     10.40.0.2   k8s-worker3   <none>           <none>
hello-world-7884c6997d-s89c5   1/1     Running   0          28m     10.35.0.5   k8s-worker2   <none>           <none>
hello-world-7884c6997d-vd8ch   1/1     Running   0          28m     10.40.0.7   k8s-worker3   <none>           <none>
hello-world-7884c6997d-wvnh7   1/1     Running   0          28m     10.36.0.6   k8s-worker1   <none>           <none>
hello-world-7884c6997d-z57kx   1/1     Running   0          49m     10.36.0.1   k8s-worker1   <none>           <none>
nginx-6799fc88d8-gm5ls         1/1     Running   0          4h11m   10.35.0.1   k8s-worker2   <none>           <none>
nginx-6799fc88d8-k2jtw         1/1     Running   0          4h11m   10.44.0.1   k8s-worker1   <none>           <none>
nginx-6799fc88d8-mc5mz         1/1     Running   0          4h12m   10.36.0.0   k8s-worker1   <none>           <none>
nginx-6799fc88d8-qn6mh         1/1     Running   0          4h11m   10.35.0.2   k8s-worker2   <none>           <none>
[root@k8s-master ~]# curl k8s-worker1:30205
curl: (7) Failed to connect to k8s-worker1 port 30205: No route to host

I suspect this is the underlying reason why I can't join the cluster to rancher.

EDIT: I want to add additional details to this question. Each of my nodes (master & worker nodes) have the following ports open on the firewall:

firewall-cmd --list-ports
6443/tcp 2379-2380/tcp 10250/tcp 10251/tcp 10252/tcp 10255/tcp 6783/tcp 6783/udp 6784/udp

For the CNI, the Kubernetes cluster is using Weavenet.

Each node (master & worker) is configured to use my main home DNS server (which is also an active directory domain controller) in their networking configuration. I've created AAA records for each node in the DNS server. The nodes are NOT joined to the domain. However, I've also edited each node's /etc/hosts file to contain the following records:

# more /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.188.190 k8s-master
192.168.188.191 k8s-worker1
192.168.188.192 k8s-worker2
192.168.188.193 k8s-worker3

I've found that I CAN use "curl k8s-worker1.mydomain.com:30370" with about 33% success. But I would have thought that the /etc/hosts file would take precedence over using my home DNS server.

And finally, I noticed an additional anomaly. I've discovered that the cluster is not load balancing across the three worker nodes. As shown above, I'm running a deployment called "hello-world" based on the bashofmann/rancher-demo image with 20 replicas. I've also created a nodeport service for hello-world that maps nodeport 30370 to port 8080 on each respective pod.

If I open my web browser and go to http://192.168.188.191:30370 then it'll load the website but only served up by pods on k8s-worker1. It'll never load the website served up by any pods on any of the other worker nodes. This would explain why I only get ~33% success, as long as it's served up by the same worker node that I've specified in my url.

Solution

OP confirmed that the issue is found to be due to firewall rules. This was debugged by disabling the firewall, that lead to desired operation(cluster addition) to be successful.

In order to nodePort service to work properly, port range 30000 - 32767 should be reachable on all the nodes of the cluster.