Search code examples
kuberneteskubeadmcalicocorednscni

Improper cni install preventing coredns pods from starting


Just installed a single master cluster using kubeadm v1.15.0. However, coredns seems stuck in pending mode:

coredns-5c98db65d4-4pm65                      0/1     Pending    0          2m17s   <none>        <none>                <none>           <none>
coredns-5c98db65d4-55hcc                      0/1     Pending    0          2m2s    <none>        <none>                <none>           <none>

the following is what shows up for the pod:

kubectl describe pods coredns-5c98db65d4-4pm65 --namespace=kube-system
Name:                 coredns-5c98db65d4-4pm65
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               k8s-app=kube-dns
                      pod-template-hash=5c98db65d4
Annotations:          <none>
Status:               Pending
IP:
Controlled By:        ReplicaSet/coredns-5c98db65d4
Containers:
  coredns:
    Image:       k8s.gcr.io/coredns:1.3.1
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-5t2wn (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-5t2wn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-5t2wn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  61s (x4 over 5m21s)  default-scheduler  0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.

I removed the taint on the master node, to no avail. Shouldn't I be able to created a single node master without any problems like this. I know scheduling pods on the master is not possible without removing the taint, but this is odd.

I tried adding the latest calico cni, to no avail, too.

I get the following running journalctl (systemctl shows no errors):

sudo journalctl -xn --unit kubelet.service
[sudo] password for gms:
-- Logs begin at Fri 2019-07-12 04:31:34 CDT, end at Tue 2019-07-16 16:58:17 CDT. --
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:54.122355   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:54 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:54.400606   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:57:59.124863   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:57:59 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:57:59.400924   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:04.127120   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:04 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:04.401266   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:09.129287   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:09 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:09.401520   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: E0716 16:58:14.133059   11250 kubelet.go:2169] Container runtime network not ready: NetworkReady=false reason:NetworkPl
Jul 16 16:58:14 thalia0.ahc.umn.edu kubelet[11250]: W0716 16:58:14.402008   11250 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d

Indeed, when I look in /etc/cni/net.d there is nothing there -> yes, I ran kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml... this is the output when I apply this:

configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created

I ran the following on the pod for calico-node, which is stuck in the following state:

calico-node-tcfhw    0/1     Init:0/3   0          11m   10.32.3.158




describe pods calico-node-tcfhw --namespace=kube-system
Name:                 calico-node-tcfhw
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 thalia0.ahc.umn.edu/10.32.3.158
Start Time:           Tue, 16 Jul 2019 18:08:25 -0500
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Pending
IP:                   10.32.3.158
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://1e1bf9e65cb182656f6f06a1bb8291237562f0f5a375e557a454942e81d32063
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://docker.io/calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Running
      Started:      Tue, 16 Jul 2019 18:08:26 -0500
    Ready:          False
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
  install-cni:
    Container ID:
    Image:         calico/cni:v3.8.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
  flexvol-driver:
    Container ID:
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Containers:
  calico-node:
    Container ID:
    Image:          calico/node:v3.8.0
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-b9c6p (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-b9c6p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-b9c6p
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From                          Message
  ----    ------     ----   ----                          -------
  Normal  Scheduled  9m15s  default-scheduler             Successfully assigned kube-system/calico-node-tcfhw to thalia0.ahc.umn.edu
  Normal  Pulled     9m14s  kubelet, thalia0.ahc.umn.edu  Container image "calico/cni:v3.8.0" already present on machine
  Normal  Created    9m14s  kubelet, thalia0.ahc.umn.edu  Created container upgrade-ipam
  Normal  Started    9m14s  kubelet, thalia0.ahc.umn.edu  Started container upgrade-ipam

I tried Flannel as a cni, but that was even worse. The kube-proxy wouldn't even start due to a taint!

EDIT ADDENDUM

Should the kube-controller-manager and kube-scheduler not have defined endpoints?

[gms@thalia0 ~]$ kubectl get ep --namespace=kube-system -o wide
NAME                      ENDPOINTS   AGE
kube-controller-manager   <none>      19h
kube-dns                  <none>      19h
kube-scheduler            <none>      19h

[gms@thalia0 ~]$ kubectl get pods --namespace=kube-system
NAME                                          READY   STATUS    RESTARTS   AGE
coredns-5c98db65d4-nmn4g                      0/1     Pending   0          19h
coredns-5c98db65d4-qv8fm                      0/1     Pending   0          19h
etcd-thalia0.x.x.edu.                         1/1     Running   0          19h
kube-apiserver-thalia0.x.x.edu                1/1     Running   0          19h
kube-controller-manager-thalia0.x.x.edu       1/1     Running   0          19h
kube-proxy-4hrdc                              1/1     Running   0          19h
kube-proxy-vb594                              1/1     Running   0          19h
kube-proxy-zwrst                              1/1     Running   0          19h
kube-scheduler-thalia0.x.x.edu                1/1     Running   0          19h

Lastly, for sanity's sake, I tried v1.13.1, and voila! Success:

NAME                                          READY   STATUS    RESTARTS   AGE
calico-node-pbrps                             2/2     Running   0          15s
coredns-86c58d9df4-g5944                      1/1     Running   0          2m40s
coredns-86c58d9df4-zntjl                      1/1     Running   0          2m40s
etcd-thalia0.ahc.umn.edu                      1/1     Running   0          110s
kube-apiserver-thalia0.ahc.umn.edu            1/1     Running   0          105s
kube-controller-manager-thalia0.ahc.umn.edu   1/1     Running   0          103s
kube-proxy-qxh2h                              1/1     Running   0          2m39s
kube-scheduler-thalia0.ahc.umn.edu            1/1     Running   0          117s

EDIT 2

Tried sudo kubeadm upgrade plan and got an error on api-server's health and bad certs.

Ran this on the api-server:

kubectl logs kube-apiserver-thalia0.x.x.edu --namespace=kube-system1

and got a ton of errors of the sort TLS handshake error from 10.x.x.157:52384: remote error: tls: bad certificate, which were from nodes that have long been deleted from the cluster and, long after several kubeadm resets on the master, along with uninstall/reinstall of kubelet, kubeadm, etc.

Why are these old nodes showing up? Don't the certs get recreated on a kubeadm init?


Solution

  • This issue https://github.com/projectcalico/calico/issues/2699 had similar symptoms and indicates that deleting /var/lib/cni/ fixed the issue. You could see if it exists and delete it if so.