Search code examples

node-exporter targets not showing in prometheus UI

I have a Kubernetes cluster set up using kubeadm. I installed prometheus and node-exporter on top of it based on:

The pods seem to be running properly:

 kubectl get pods --namespace=monitoring -o wide
NAME                                     READY   STATUS    RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
node-exporter-jk2sd                      1/1     Running   0          90m   work03   <none>           <none>
node-exporter-jldrx                      1/1     Running   0          90m   work04   <none>           <none>
node-exporter-mgtld                      1/1     Running   0          90m   work01   <none>           <none>
node-exporter-tq7bx                      1/1     Running   0          90m   work02   <none>           <none>
prometheus-deployment-5d79b5f65b-tkpd2   1/1     Running   0          91m   work02   <none>           <none>

I can see the endpoints, as well:

kubectl get endpoints -n monitoring
NAME            ENDPOINTS                                                           AGE
node-exporter,, + 1 more...   5m3s

I also did: kubectl port-forward prometheus-deployment-5d79b5f65b-tkpd2 8080:9090 -n monitoring and when I access the prometheus web UI > Status > Targets, I don't find node-exporters there. When I start typing a query for a metric reported by node-exporter, it doesn't automatically show up in the query editor.

Logs coming from the prometheus pod seem to have a lot of errors:

kubectl logs prometheus-deployment-5d79b5f65b-tkpd2 -n monitoring
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:428 msg="Starting Prometheus" version="(version=2.29.1, branch=HEAD, revision=dcb07e8eac34b5ea37cd229545000b857f1c1637)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:433 build_context="(go=go1.16.7, user=root@364730518a4e, date=20210811-14:48:27)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:434 host_details="(Linux 5.4.0-70-generic #78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021 x86_64 prometheus-deployment-5d79b5f65b-tkpd2 (none))"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:435 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-08-11T16:24:21.743Z caller=main.go:436 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-08-11T16:24:21.745Z caller=web.go:541 component=web msg="Start listening for connections" address=
level=info ts=2021-08-11T16:24:21.745Z caller=main.go:812 msg="Starting TSDB ..."
level=info ts=2021-08-11T16:24:21.748Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:815 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:829 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.15µs
level=info ts=2021-08-11T16:24:21.753Z caller=head.go:835 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:892 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-08-11T16:24:21.754Z caller=head.go:898 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=75.316µs wal_replay_duration=451.769µs total_replay_duration=566.051µs
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:839 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:842 msg="TSDB started"
level=info ts=2021-08-11T16:24:21.756Z caller=main.go:969 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-08-11T16:24:21.757Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.759Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.762Z caller=kubernetes.go:282 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:1006 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=7.940972ms db_storage=607ns remote_storage=1.251µs web_handler=283ns query_engine=694ns scrape=227.668µs scrape_sd=6.081132ms notify=27.11µs notify_sd=16.477µs rules=648.58µs
level=info ts=2021-08-11T16:24:21.764Z caller=main.go:784 msg="Server is ready to receive web requests."
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:24:51.765Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:24:51.766Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:22.587Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:22.855Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:23.153Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:23.261Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:23.335Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:54.814Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:55.282Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:55.516Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:55.934Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:25:56.442Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:26:30.058Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:26:30.204Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:26:30.246Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:26:30.879Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:26:31.479Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:09.673Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:09.835Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:10.467Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:11.170Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:12.684Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:27:55.324Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Service: failed to list *v1.Service: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:28:01.550Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:28:01.621Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:28:04.801Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:28:05.598Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Node: failed to list *v1.Node: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:28:57.256Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"
level=error ts=2021-08-11T16:29:04.688Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/ Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"\": dial tcp i/o timeout"

Is there a way to solve this issue and make node-exporters show up in the targets?

Version details:

kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

Edit: The cluster was set up as follows:

sudo kubeadm reset
sudo rm $HOME/.kube/config
sudo kubeadm init --pod-network-cidr=
mkdir -p $HOME/.kube; sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config; sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f

It is using flannel.

flannel pods are running:

kube-flannel-ds-45qwf                1/1     Running   0          31h   x.x.x.41   work01   <none>           <none>
kube-flannel-ds-4rwzj                1/1     Running   0          31h   x.x.x.40   mast01   <none>           <none>
kube-flannel-ds-8fdtt                1/1     Running   24         31h   x.x.x.43   work03   <none>           <none>
kube-flannel-ds-8hl5f                1/1     Running   23         31h   x.x.x.44   work04   <none>           <none>
kube-flannel-ds-xqtrd                1/1     Running   0          31h   x.x.x.42   work02   <none>           <none>


  • The issue is related to SDN not working properly.

    As a general rule, troubleshooting this, we would check the SDN pods (calico, weave, or in this case flannel), are they healthy, any errors in their logs, ...

    Check iptables (iptables -nL) and ipvs (ipvsadm -l n) configuration nodes.

    Restart SDN pods, as well as kube-proxy, if you still didn't find anything.

    Now, on this specific case, we're not suffering from an outage: cluster is freshly deployed, it's likely the SDN never worked at all - though this may not be obvious, with a kubeadm deployment, that doesn't ship with other pods than the defaults, most of which using host networking.

    The kubeadm init command mentions that pod CIDR is some, which brings two remarks:

    • with all SDN: the pod CIDR is a subnet that will be split into smaller subnets (usually /24 or /25). Each range being statically allocated to Nodes when they first join your cluster

    • running flannel SDN: kubeadm init should include a --pod-network-cidr argument that MUST match the subnet configured in the kube-flannel-cfg ConfigMap, see net-conf.json key.

    Though I'm unfamiliar with the process of fixing this, there seem to be an answer on ServerFault that gives some instructions, which sounds right: