Search code examples
kubernetesk3smetrics-server

k3s - Metrics server doesn't work for worker nodes


I deployed a k3s cluster into 2 raspberry pi 4. One as a master and the second as a worker using the script k3s offered with the following options:

For the master node:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='server --bind-address 192.168.1.113 (which is the master node ip)' sh -

To the agent node:

curl -sfL https://get.k3s.io | \
                K3S_URL=https://192.168.1.113:6443 \
                K3S_TOKEN=<master-token> \
                INSTALL_K3S_EXEC='agent' sh-

Everything seems to work, but kubectl top nodes returns the following:

NAME          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
k3s-master    137m         3%     1285Mi          33%         
k3s-node-01   <unknown>                           <unknown>               <unknown>               <unknown>

I also tried to deploy the k8s dashboard, according to what is written in the docs but it fails to work because it can't reach the metrics server and gets a timeout error:

"error trying to reach service: dial tcp 10.42.1.11:8443: i/o timeout"

and I see a lot of errors in the pod logs:

2021/09/17 09:24:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:25:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:26:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.
2021/09/17 09:27:06 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.

logs from the metrics-server pod:

elet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:03:24.767949       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host
E0917 14:04:24.767960       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:k3s-node-01: unable to fetch metrics from Kubelet k3s-node-01 (k3s-node-01): Get https://k3s-node-01:10250/stats/summary?only_cpu_and_memory=true: dial tcp 192.168.1.106:10250: connect: no route to host

Solution

  • Moving this out of comments for better visibility.


    After creation of small cluster, I wasn't able to reproduce this behaviour and metrics-server worked fine for both nodes, kubectl top nodes showed information and metrics about both available nodes (thought it took some time to start collecting the metrics).

    Which leads to troubleshooting steps why it doesn't work. Checking metrics-server logs is the most efficient way to figure this out:

    $ kubectl logs metrics-server-58b44df574-2n9dn -n kube-system
    

    Based on logs it will be different steps to continue, for instance in comments above:

    • first it was no route to host which is related to network and lack of possibility to resolve hostname
    • then i/o timeout which means route exists, but service did not respond back. This may happen due to firewall which blocks certain ports/sources, kubelet is not running (listens to port 10250) or as it appeared for OP, there was an issue with ntp which affected certificates and connections.
    • errors may be different in other cases, it's important to find the error and based on it troubleshoot further.