Search code examples
dockerprometheusgrafanadocker-swarmprometheus-node-exporter

Error query in Grafana/Prometheus with node-exporter in docker swarm mode


I am struggling a query problem with Grafana variable query in Dashboard configuration. The query variable should return the number of nodes joined the swarm but it did not. In my case, I only have one swarm node but the variable in Grafana returns up to 5 nodes. I relly don't understand what causes the error.

Here is the situation: I set up docker swarm in my laptop as a manager, only my laptop with the swarm mode, no other nodes joined. I used the source from https://github.com/stefanprodan/swarmprom to monitor the host by node-exporter. I kept the prometheus.yml as original.

when I executes the metric from prometheus, only one host returned. This is correct because I only had one node. You can see the figure below enter image description here

But when I did the query in Grafana, Grafana returned 5 hosts. It was really strange here. I dont know why I got 5 hosts because I had only one swarm node. enter image description here

I did check the git repo again with play-with-docker, configured one manager node and 2 client nodes. Everything worked fine. The query in Grafana returned 3 hosts. enter image description here

Here is the query formula: label_values(node_uname_info{job="node-exporter"}, instance) Thank you so much for you supporting in advance.


Solution

  • What you have faced is a consequence of ephemeral container nature, one of the challenges in monitoring container applications. Before we go into any solution options, let us see ...

    How it did happen that Grafana shows more instances than there is.

    Prometheus is a time-series database. Once in a while it contacts its scraping targets and collects metrics. Those metrics are saved with a time stamp and a set of labels, one of which is the 'instance' label in question.

    The instance label normally consists of an address (a host/domain name or an IP-address) and a port, that prometheus uses to scrape metrics. In this example instance address is an IP-address, because the list of targets is obtained through a DNS-server (dns_sd_configs in job definition).

    When you deployed the stack, docker created at least one container for each service, including node-exporter and prometheus. Soon after that prometheus started obtaining metrics from node-exporter instance, however after some time node-exporter container was recreated. Either you updated it, or killed it, or it's crashed - I can't know, but the key is - you had a new container. The new node-exporter container got a different IP-address and because of that metrics from the new instance received a different 'instance' label.

    Remember that prometheus is a time series database? You have not lost metrics from the instance that went offline, they're still in the database. It is just at this point you had started collecting node-exporter metrics with a different label set (new IP-address in the 'instance' label at least). When Grafana queries labels for you, it requests metrics from the period currently set on the dashboard. Since the period was 'today', you've seen instances that were present today. In other words when you request a list of possible instance values, you receive a list of values for the period without any filtering for currently active instances.

    General solution.

    You need to use some static label(s) for this task. An 'instance' or a 'pod_name' (K8s) labels are a poor choice if you don't like to see dead instances in the list. Pick a label that represents the thing or unit you want to watch and stick to it. Since node-exporter is to monitor node metrics, I think a host name label will do.

    If you see no way in avoiding use of dynamic labels, you can use a short time range on the dashboard, so that the label_values() function does not return long dead labels. You'd like to set variable refresh option to 'On Time Range Change', so that you can use a short dashboard interval to see and pick currently active instances, and a long period for any other case.

    An option for this particular problem.

    As I said previously, using a host name label will be better in this case. The problem is - there is no such label in the metric in question. Checking swarmprom repo, I found that this node-exporter was made to expose a host name via node_meta label (here). So it is possible to map a host name to an instance(s) using chained variables.

    Another problem is that this solution may require changes in panel queries. Since one host name can resolve into multiple instances, it is essential that panel queries use regex match for 'instance' label (that is =~ instead of =).

    Here's how to do all this:

    1. Create a new variable called 'hostname', set refresh option to 'On Time Range Change', and use this for the query field:
    label_values(node_meta, node_name)
    

    This one will be used as a selector on the dashboard.

    1. Update the 'node' variable: set refresh option to 'On Time Range Change', enable 'Multi-value' and 'Add All option', replace query with this:
    label_values(node_meta{node_name="$hostname"}, instance)
    

    This will return a set of 'instance' labels matching the selected 'hostname'. If you select all and update panel queries to support multi-value instance label, you will be able to view metrics from all container instances associated with the selected host name.

    1. Open dashboard JSON model and copy it in your favourite text editor. Replace all occurrences of instance= with instance=~, then copy-paste the edited model in Grafana.