Search code examples
kuberneteskubectlkube-dns

Newly provisioned kubernetes nodes are inaccessible by kubectl


I am using Kubespray with Kubernetes 1.9

What I'm seeing is the following when I try to interact with pods on my new nodes in anyway through kubectl. Important to note that the nodes are considered to be healthy and are having pods scheduled on them appropriately. The pods are totally functional.

    ➜  Scripts k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

I am able to ping to the kubeworker nodes both locally where I am running kubectl and from all masters by both IP and DNS.

➜  Scripts ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111): 56 data bytes
64 bytes from 10.0.0.111: icmp_seq=0 ttl=63 time=88.972 ms
^C

pubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=2 ttl=64 time=0.213 ms


➜  Scripts k get nodes
NAME                       STATUS    ROLES     AGE       VERSION
kubemaster-rwva1-prod-1    Ready     master    174d      v1.9.2+coreos.0
kubemaster-rwva1-prod-2    Ready     master    174d      v1.9.2+coreos.0
kubemaster-rwva1-prod-3    Ready     master    174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-1    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-10   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-11   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-12   Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-14   Ready     node      16d       v1.9.2+coreos.0
kubeworker-rwva1-prod-15   Ready     node      14d       v1.9.2+coreos.0
kubeworker-rwva1-prod-16   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-17   Ready     node      4d        v1.9.2+coreos.0
kubeworker-rwva1-prod-18   Ready     node      4d        v1.9.2+coreos.0
kubeworker-rwva1-prod-19   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-2    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-20   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-21   Ready     node      6d        v1.9.2+coreos.0
kubeworker-rwva1-prod-3    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-4    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-5    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-6    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-7    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-8    Ready     node      174d      v1.9.2+coreos.0
kubeworker-rwva1-prod-9    Ready     node      174d      v1.9.2+coreos.0

When I describe a broken node, it looks identical to one of my functioning ones.

➜  Scripts k describe node kubeworker-rwva1-prod-14
Name:               kubeworker-rwva1-prod-14
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=kubeworker-rwva1-prod-14
                    node-role.kubernetes.io/node=true
                    role=app-tier
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 17 Jul 2018 19:35:08 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:08 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Fri, 03 Aug 2018 18:44:59 -0700   Tue, 17 Jul 2018 19:35:18 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.0.111
  Hostname:    kubeworker-rwva1-prod-14
Capacity:
 cpu:     32
 memory:  147701524Ki
 pods:    110
Allocatable:
 cpu:     31900m
 memory:  147349124Ki
 pods:    110
System Info:
 Machine ID:                 da30025a3f8fd6c3f4de778c5b4cf558
 System UUID:                5ACCBB64-2533-E611-97F0-0894EF1D343B
 Boot ID:                    6b42ba3e-36c4-4520-97e6-e7c6fed195e2
 Kernel Version:             4.4.0-130-generic
 OS Image:                   Ubuntu 16.04.4 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.1
 Kubelet Version:            v1.9.2+coreos.0
 Kube-Proxy Version:         v1.9.2+coreos.0
ExternalID:                  kubeworker-rwva1-prod-14
Non-terminated Pods:         (5 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                         ------------  ----------  ---------------  -------------
  kube-system                calico-node-cd7qg                            150m (0%)     300m (0%)   64M (0%)         500M (0%)
  kube-system                kube-proxy-kubeworker-rwva1-prod-14          150m (0%)     500m (1%)   64M (0%)         2G (1%)
  kube-system                nginx-proxy-kubeworker-rwva1-prod-14         25m (0%)      300m (0%)   32M (0%)         512M (0%)
  prometheus                 prometheus-prometheus-node-exporter-gckzj    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  rabbit-relay               rabbit-relay-844d6865c7-q6fr2                0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  325m (1%)     1100m (3%)  160M (0%)        3012M (1%)
Events:         <none>

➜  Scripts k describe node kubeworker-rwva1-prod-11
Name:               kubeworker-rwva1-prod-11
Roles:              node
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=kubeworker-rwva1-prod-11
                    node-role.kubernetes.io/node=true
                    role=test
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Fri, 09 Feb 2018 21:03:46 -0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Fri, 03 Aug 2018 18:46:31 -0700   Fri, 09 Feb 2018 21:03:38 -0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Fri, 03 Aug 2018 18:46:31 -0700   Mon, 16 Jul 2018 13:24:58 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.0.218
  Hostname:    kubeworker-rwva1-prod-11
Capacity:
 cpu:     32
 memory:  131985484Ki
 pods:    110
Allocatable:
 cpu:     31900m
 memory:  131633084Ki
 pods:    110
System Info:
 Machine ID:                 0ff6f3b9214b38ad07c063d45a6a5175
 System UUID:                4C4C4544-0044-5710-8037-B3C04F525631
 Boot ID:                    4d7ec0fc-428f-4b4c-aaae-8e70f374fbb1
 Kernel Version:             4.4.0-87-generic
 OS Image:                   Ubuntu 16.04.3 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.1
 Kubelet Version:            v1.9.2+coreos.0
 Kube-Proxy Version:         v1.9.2+coreos.0
ExternalID:                  kubeworker-rwva1-prod-11
Non-terminated Pods:         (6 in total)
  Namespace                  Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                         ------------  ----------  ---------------  -------------
  ingress-nginx-internal     default-http-backend-internal-7c8ff87c86-955np               10m (0%)      10m (0%)    20Mi (0%)        20Mi (0%)
  kube-system                calico-node-8fzk6                                            150m (0%)     300m (0%)   64M (0%)         500M (0%)
  kube-system                kube-proxy-kubeworker-rwva1-prod-11                          150m (0%)     500m (1%)   64M (0%)         2G (1%)
  kube-system                nginx-proxy-kubeworker-rwva1-prod-11                         25m (0%)      300m (0%)   32M (0%)         512M (0%)
  prometheus                 prometheus-prometheus-kube-state-metrics-7c5cbb6f55-jq97n    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  prometheus                 prometheus-prometheus-node-exporter-7gn2x                    0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  335m (1%)     1110m (3%)  176730Ki (0%)    3032971520 (2%)
Events:         <none>

What's going on?

➜  k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj

    Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

    ➜  cat /etc/hosts | head -n1
    10.0.0.111 kubeworker-rwva1-prod-14

ubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.275 ms
^C
--- kubeworker-rwva1-prod-14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.275/0.275/0.275/0.000 ms

ubuntu@kubemaster-rwva1-prod-1:~$ kubectl logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host

Solution

  • Insane problem. I don't know exactly how I fixed this. But I somehow put it back together by deleting one of my non functional nodes and re-registering it with the full FQDN. This somehow fixed everything. I was then able to delete the FQDN registered node and recreate it the short name.

    After a lot of TCPdumping the best explanation I can come up with is the error message was accurate but in a really stupid and confusing way.

        {"kind":"Pod","apiVersion":"v1","metadata":{"name":"prometheus-prometheus-node-exporter-gckzj","generateName":"prometheus-prometheus-node-exporter-","namespace":"prometheus","selfLink":"/api/v1/namespaces/prometheus/pods/prometheus-prometheus-node-exporter-gckzj","uid":"2fa4b744-8a33-11e8-9b15-bc305bef2e18","resourceVersion":"37138627","creationTimestamp":"2018-07-18T02:35:08Z","labels":{"app":"prometheus","component":"node-exporter","controller-revision-hash":"1725903292","pod-template-generation":"1","release":"prometheus"},"ownerReferences":[{"apiVersion":"extensions/v1beta1","kind":"DaemonSet","name":"prometheus-prometheus-node-exporter","uid":"e9216885-1616-11e8-b853-d4ae528b79ed","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"proc","hostPath":{"path":"/proc","type":""}},{"name":"sys","hostPath":{"path":"/sys","type":""}},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","secret":{"secretName":"prometheus-prometheus-node-exporter-token-zvrdk","defaultMode":420}}],"containers":[{"name":"prometheus-node-exporter","image":"prom/node-exporter:v0.15.2","args":["--path.procfs=/host/proc","--path.sysfs=/host/sys"],"ports":[{"name":"metrics","hostPort":9100,"containerPort":9100,"protocol":"TCP"}],"resources":{},"volumeMounts":[{"name":"proc","readOnly":true,"mountPath":"/host/proc"},{"name":"sys","readOnly":true,"mountPath":"/host/sys"},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"prometheus-prometheus-node-exporter","serviceAccount":"prometheus-prometheus-node-exporter","nodeName":"kubeworker-rwva1-prod-14","hostNetwork":true,"hostPID":true,"securityContext":{},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/disk-pressure","operator":"Exists","effect":"NoSchedule"},{"key":"node.kubernetes.io/memory-pressure","operator":"Exists","effect":"NoSchedule"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:13Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-20T08:02:58Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:14Z"}],"hostIP":"10.0.0.111","podIP":"10.0.0.111","startTime":"2018-07-18T02:35:13Z","containerStatuses":[{"name":"prometheus-node-exporter","state":{"running":{"startedAt":"2018-07-20T08:02:58Z"}},"lastState":{"terminated":{"exitCode":143,"reason":"Error","startedAt":"2018-07-20T08:02:27Z","finishedAt":"2018-07-20T08:02:39Z","containerID":"docker://db44927ad64eb130a73bee3c7b250f55ad911584415c373d3e3fa0fc838c146e"}},"ready":true,"restartCount":2,"image":"prom/node-exporter:v0.15.2","imageID":"docker-pullable://prom/node-exporter@sha256:6965ed8f31c5edba19d269d10238f59624e6b004f650ce925b3408ce222f9e49","containerID":"docker://4743ad5c5e60c31077e57d51eb522270c96ed227bab6522b4fcde826c4abc064"}],"qosClass":"BestEffort"}}
    {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host","code":500}
    

    The internal DNS of the cluster was not able to properly read the API to generate the necessary records. Without a name that the DNS was authoritative for, the cluster was using my upstream DNS records to attempt to recursively resolve the name. The upstream DNS server didn't know what to do with the short form name without a tld suffix.