I have a docker swarm with three nodes, one manager and two workers. Each node runs node-exporter in a local container (not part of any swarm stack). The compose file used to create this container in each node separately is as follows:
version: '3'
services:
node_exporter:
image: quay.io/prometheus/node-exporter:latest
container_name: node_exporter
command:
- '--path.rootfs=/host'
network_mode: host
pid: host
restart: unless-stopped
volumes:
- '/:/host:ro,rslave'
Prometheus runs in the swarm using a stack like:
version: '3'
services:
prometheus:
image: prom/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
deploy:
replicas: 1
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
volumes:
- /home/ubuntu/monitor/etc/prometheus:/etc/prometheus
- /home/ubuntu/monitor/prometheus:/prometheus
networks:
- monitnw
networks:
monitnw:
external: true
Where prometheus.yml includes a "job" to get the metrics from node-exporter in the three nodes:
- job_name: Node-exporter metrics
static_configs:
- targets: ['X.Y.Z.178:9100', 'X.Y.Z.218:9100', 'X.Y.Z.93:9100']
In all three nodes, the firewall is configured to allow all traffic from the other two peer nodes.
In Grafana, I can see that the metrics are reported correctly for the two peers of the node in which Prometheus is running, but there is no report for the node in which Prometheus is running. This is the same regardless of the node that runs Prometheus.
What am I missing here?,
why is Prometheus not "seeing" the metrics from node-exporter for the node in which it is running?
Any hint will be much appreciated.
I just had to set a rule in the FW of the host running the prometheus service (only one in my case, no replicas) to allow 172.18.0.0/16 on port 9100.
Reason: Docker stack services running in containers inside a host use network docker_gwbridge to communicate with the host. In my case, port 9100 was only open to peer swarm nodes, but not to docker_gwbridge. This docker network used subnet 172.18.0.0/16 in my case.
Obviously, this is a swarm and when I remove the stack and run it again, the same service may run on different host(s). This may also happen at any time if the service replica is restarted. To be safe I configured docker_gwbridge in all nodes in the swarm as follows:
docker network create \
--subnet 172.18.0.0/16 \
--opt com.docker.network.bridge.name=docker_gwbridge \
--opt com.docker.network.bridge.enable_icc=false \
--opt com.docker.network.bridge.enable_ip_masquerade=true \
docker_gwbridge
As described here.
Probably not the most elegant solution, but the only one I found that works in this case. In any case, I have moved this part of my project to GrafanaLabs, which covers my needs well for this ;-)