I am struggling a query problem with Grafana variable
query in Dashboard configuration. The query variable should return the number of nodes joined the swarm but it did not. In my case, I only have one swarm node but the variable in Grafana returns up to 5 nodes. I relly don't understand what causes the error.
Here is the situation: I set up docker swarm in my laptop as a manager, only my laptop with the swarm mode, no other nodes joined.
I used the source from https://github.com/stefanprodan/swarmprom to monitor the host by node-exporter
. I kept the prometheus.yml
as original.
when I executes the metric from prometheus
, only one host returned. This is correct because I only had one node. You can see the figure below
But when I did the query in Grafana, Grafana returned 5 hosts. It was really strange here. I dont know why I got 5 hosts because I had only one swarm node.
I did check the git repo again with play-with-docker, configured one manager node and 2 client nodes. Everything worked fine. The query in Grafana returned 3 hosts.
Here is the query formula: label_values(node_uname_info{job="node-exporter"}, instance)
Thank you so much for you supporting in advance.
What you have faced is a consequence of ephemeral container nature, one of the challenges in monitoring container applications. Before we go into any solution options, let us see ...
Prometheus is a time-series database. Once in a while it contacts its scraping targets and collects metrics. Those metrics are saved with a time stamp and a set of labels, one of which is the 'instance' label in question.
The instance label normally consists of an address (a host/domain name or an IP-address) and a port, that prometheus uses to scrape metrics. In this example instance address is an IP-address, because the list of targets is obtained through a DNS-server (dns_sd_configs
in job definition).
When you deployed the stack, docker created at least one container for each service, including node-exporter and prometheus. Soon after that prometheus started obtaining metrics from node-exporter instance, however after some time node-exporter container was recreated. Either you updated it, or killed it, or it's crashed - I can't know, but the key is - you had a new container. The new node-exporter container got a different IP-address and because of that metrics from the new instance received a different 'instance' label.
Remember that prometheus is a time series database? You have not lost metrics from the instance that went offline, they're still in the database. It is just at this point you had started collecting node-exporter metrics with a different label set (new IP-address in the 'instance' label at least). When Grafana queries labels for you, it requests metrics from the period currently set on the dashboard. Since the period was 'today', you've seen instances that were present today. In other words when you request a list of possible instance values, you receive a list of values for the period without any filtering for currently active instances.
You need to use some static label(s) for this task. An 'instance' or a 'pod_name' (K8s) labels are a poor choice if you don't like to see dead instances in the list. Pick a label that represents the thing or unit you want to watch and stick to it. Since node-exporter is to monitor node metrics, I think a host name label will do.
If you see no way in avoiding use of dynamic labels, you can use a short time range on the dashboard, so that the label_values()
function does not return long dead labels. You'd like to set variable refresh option to 'On Time Range Change', so that you can use a short dashboard interval to see and pick currently active instances, and a long period for any other case.
As I said previously, using a host name label will be better in this case. The problem is - there is no such label in the metric in question. Checking swarmprom repo, I found that this node-exporter was made to expose a host name via node_meta
label (here). So it is possible to map a host name to an instance(s) using chained variables.
Another problem is that this solution may require changes in panel queries. Since one host name can resolve into multiple instances, it is essential that panel queries use regex match for 'instance' label (that is =~
instead of =
).
Here's how to do all this:
label_values(node_meta, node_name)
This one will be used as a selector on the dashboard.
label_values(node_meta{node_name="$hostname"}, instance)
This will return a set of 'instance' labels matching the selected 'hostname'. If you select all and update panel queries to support multi-value instance label, you will be able to view metrics from all container instances associated with the selected host name.
instance=
with instance=~
, then copy-paste the edited model in Grafana.