I am using Kubespray with Kubernetes 1.9
What I'm seeing is the following when I try to interact with pods on my new nodes in anyway through kubectl. Important to note that the nodes are considered to be healthy and are having pods scheduled on them appropriately. The pods are totally functional.
➜ Scripts k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host
I am able to ping to the kubeworker nodes both locally where I am running kubectl and from all masters by both IP and DNS.
➜ Scripts ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111): 56 data bytes
64 bytes from 10.0.0.111: icmp_seq=0 ttl=63 time=88.972 ms
^C
pubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.259 ms
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=2 ttl=64 time=0.213 ms
➜ Scripts k get nodes
NAME STATUS ROLES AGE VERSION
kubemaster-rwva1-prod-1 Ready master 174d v1.9.2+coreos.0
kubemaster-rwva1-prod-2 Ready master 174d v1.9.2+coreos.0
kubemaster-rwva1-prod-3 Ready master 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-1 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-10 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-11 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-12 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-14 Ready node 16d v1.9.2+coreos.0
kubeworker-rwva1-prod-15 Ready node 14d v1.9.2+coreos.0
kubeworker-rwva1-prod-16 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-17 Ready node 4d v1.9.2+coreos.0
kubeworker-rwva1-prod-18 Ready node 4d v1.9.2+coreos.0
kubeworker-rwva1-prod-19 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-2 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-20 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-21 Ready node 6d v1.9.2+coreos.0
kubeworker-rwva1-prod-3 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-4 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-5 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-6 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-7 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-8 Ready node 174d v1.9.2+coreos.0
kubeworker-rwva1-prod-9 Ready node 174d v1.9.2+coreos.0
When I describe a broken node, it looks identical to one of my functioning ones.
➜ Scripts k describe node kubeworker-rwva1-prod-14
Name: kubeworker-rwva1-prod-14
Roles: node
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=kubeworker-rwva1-prod-14
node-role.kubernetes.io/node=true
role=app-tier
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Tue, 17 Jul 2018 19:35:08 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:08 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 03 Aug 2018 18:44:59 -0700 Tue, 17 Jul 2018 19:35:18 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.111
Hostname: kubeworker-rwva1-prod-14
Capacity:
cpu: 32
memory: 147701524Ki
pods: 110
Allocatable:
cpu: 31900m
memory: 147349124Ki
pods: 110
System Info:
Machine ID: da30025a3f8fd6c3f4de778c5b4cf558
System UUID: 5ACCBB64-2533-E611-97F0-0894EF1D343B
Boot ID: 6b42ba3e-36c4-4520-97e6-e7c6fed195e2
Kernel Version: 4.4.0-130-generic
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.1
Kubelet Version: v1.9.2+coreos.0
Kube-Proxy Version: v1.9.2+coreos.0
ExternalID: kubeworker-rwva1-prod-14
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system calico-node-cd7qg 150m (0%) 300m (0%) 64M (0%) 500M (0%)
kube-system kube-proxy-kubeworker-rwva1-prod-14 150m (0%) 500m (1%) 64M (0%) 2G (1%)
kube-system nginx-proxy-kubeworker-rwva1-prod-14 25m (0%) 300m (0%) 32M (0%) 512M (0%)
prometheus prometheus-prometheus-node-exporter-gckzj 0 (0%) 0 (0%) 0 (0%) 0 (0%)
rabbit-relay rabbit-relay-844d6865c7-q6fr2 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
325m (1%) 1100m (3%) 160M (0%) 3012M (1%)
Events: <none>
➜ Scripts k describe node kubeworker-rwva1-prod-11
Name: kubeworker-rwva1-prod-11
Roles: node
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=kubeworker-rwva1-prod-11
node-role.kubernetes.io/node=true
role=test
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Fri, 09 Feb 2018 21:03:46 -0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 03 Aug 2018 18:46:31 -0700 Fri, 09 Feb 2018 21:03:38 -0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 03 Aug 2018 18:46:31 -0700 Mon, 16 Jul 2018 13:24:58 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.218
Hostname: kubeworker-rwva1-prod-11
Capacity:
cpu: 32
memory: 131985484Ki
pods: 110
Allocatable:
cpu: 31900m
memory: 131633084Ki
pods: 110
System Info:
Machine ID: 0ff6f3b9214b38ad07c063d45a6a5175
System UUID: 4C4C4544-0044-5710-8037-B3C04F525631
Boot ID: 4d7ec0fc-428f-4b4c-aaae-8e70f374fbb1
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.1
Kubelet Version: v1.9.2+coreos.0
Kube-Proxy Version: v1.9.2+coreos.0
ExternalID: kubeworker-rwva1-prod-11
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
ingress-nginx-internal default-http-backend-internal-7c8ff87c86-955np 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%)
kube-system calico-node-8fzk6 150m (0%) 300m (0%) 64M (0%) 500M (0%)
kube-system kube-proxy-kubeworker-rwva1-prod-11 150m (0%) 500m (1%) 64M (0%) 2G (1%)
kube-system nginx-proxy-kubeworker-rwva1-prod-11 25m (0%) 300m (0%) 32M (0%) 512M (0%)
prometheus prometheus-prometheus-kube-state-metrics-7c5cbb6f55-jq97n 0 (0%) 0 (0%) 0 (0%) 0 (0%)
prometheus prometheus-prometheus-node-exporter-7gn2x 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
335m (1%) 1110m (3%) 176730Ki (0%) 3032971520 (2%)
Events: <none>
What's going on?
➜ k logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host
➜ cat /etc/hosts | head -n1
10.0.0.111 kubeworker-rwva1-prod-14
ubuntu@kubemaster-rwva1-prod-1:~$ ping kubeworker-rwva1-prod-14
PING kubeworker-rwva1-prod-14 (10.0.0.111) 56(84) bytes of data.
64 bytes from kubeworker-rwva1-prod-14 (10.0.0.111): icmp_seq=1 ttl=64 time=0.275 ms
^C
--- kubeworker-rwva1-prod-14 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.275/0.275/0.275/0.000 ms
ubuntu@kubemaster-rwva1-prod-1:~$ kubectl logs -f -n prometheus prometheus-prometheus-node-exporter-gckzj
Error from server: Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host
Insane problem. I don't know exactly how I fixed this. But I somehow put it back together by deleting one of my non functional nodes and re-registering it with the full FQDN. This somehow fixed everything. I was then able to delete the FQDN registered node and recreate it the short name.
After a lot of TCPdumping the best explanation I can come up with is the error message was accurate but in a really stupid and confusing way.
{"kind":"Pod","apiVersion":"v1","metadata":{"name":"prometheus-prometheus-node-exporter-gckzj","generateName":"prometheus-prometheus-node-exporter-","namespace":"prometheus","selfLink":"/api/v1/namespaces/prometheus/pods/prometheus-prometheus-node-exporter-gckzj","uid":"2fa4b744-8a33-11e8-9b15-bc305bef2e18","resourceVersion":"37138627","creationTimestamp":"2018-07-18T02:35:08Z","labels":{"app":"prometheus","component":"node-exporter","controller-revision-hash":"1725903292","pod-template-generation":"1","release":"prometheus"},"ownerReferences":[{"apiVersion":"extensions/v1beta1","kind":"DaemonSet","name":"prometheus-prometheus-node-exporter","uid":"e9216885-1616-11e8-b853-d4ae528b79ed","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"proc","hostPath":{"path":"/proc","type":""}},{"name":"sys","hostPath":{"path":"/sys","type":""}},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","secret":{"secretName":"prometheus-prometheus-node-exporter-token-zvrdk","defaultMode":420}}],"containers":[{"name":"prometheus-node-exporter","image":"prom/node-exporter:v0.15.2","args":["--path.procfs=/host/proc","--path.sysfs=/host/sys"],"ports":[{"name":"metrics","hostPort":9100,"containerPort":9100,"protocol":"TCP"}],"resources":{},"volumeMounts":[{"name":"proc","readOnly":true,"mountPath":"/host/proc"},{"name":"sys","readOnly":true,"mountPath":"/host/sys"},{"name":"prometheus-prometheus-node-exporter-token-zvrdk","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"prometheus-prometheus-node-exporter","serviceAccount":"prometheus-prometheus-node-exporter","nodeName":"kubeworker-rwva1-prod-14","hostNetwork":true,"hostPID":true,"securityContext":{},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute"},{"key":"node.kubernetes.io/disk-pressure","operator":"Exists","effect":"NoSchedule"},{"key":"node.kubernetes.io/memory-pressure","operator":"Exists","effect":"NoSchedule"}]},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:13Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-20T08:02:58Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2018-07-18T02:35:14Z"}],"hostIP":"10.0.0.111","podIP":"10.0.0.111","startTime":"2018-07-18T02:35:13Z","containerStatuses":[{"name":"prometheus-node-exporter","state":{"running":{"startedAt":"2018-07-20T08:02:58Z"}},"lastState":{"terminated":{"exitCode":143,"reason":"Error","startedAt":"2018-07-20T08:02:27Z","finishedAt":"2018-07-20T08:02:39Z","containerID":"docker://db44927ad64eb130a73bee3c7b250f55ad911584415c373d3e3fa0fc838c146e"}},"ready":true,"restartCount":2,"image":"prom/node-exporter:v0.15.2","imageID":"docker-pullable://prom/node-exporter@sha256:6965ed8f31c5edba19d269d10238f59624e6b004f650ce925b3408ce222f9e49","containerID":"docker://4743ad5c5e60c31077e57d51eb522270c96ed227bab6522b4fcde826c4abc064"}],"qosClass":"BestEffort"}}
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get https://kubeworker-rwva1-prod-14:10250/containerLogs/prometheus/prometheus-prometheus-node-exporter-gckzj/prometheus-node-exporter?follow=true: dial tcp: lookup kubeworker-rwva1-prod-14 on 10.0.0.3:53: no such host","code":500}
The internal DNS of the cluster was not able to properly read the API to generate the necessary records. Without a name that the DNS was authoritative for, the cluster was using my upstream DNS records to attempt to recursively resolve the name. The upstream DNS server didn't know what to do with the short form name without a tld suffix.