Search code examples
prometheusprometheus-node-exporter

Prometheus in docker swarm not monitoring node-exporter in own node


I have a docker swarm with three nodes, one manager and two workers. Each node runs node-exporter in a local container (not part of any swarm stack). The compose file used to create this container in each node separately is as follows:

version: '3'

services:
  node_exporter:
    image: quay.io/prometheus/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    network_mode: host
    pid: host
    restart: unless-stopped
    volumes:
      - '/:/host:ro,rslave'

Prometheus runs in the swarm using a stack like:

version: '3'

services:
  prometheus:
    image: prom/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
    volumes:
      - /home/ubuntu/monitor/etc/prometheus:/etc/prometheus
      - /home/ubuntu/monitor/prometheus:/prometheus
    networks:
      - monitnw
      
networks:
  monitnw:
    external: true

Where prometheus.yml includes a "job" to get the metrics from node-exporter in the three nodes:

- job_name: Node-exporter metrics
    static_configs:
      - targets: ['X.Y.Z.178:9100', 'X.Y.Z.218:9100', 'X.Y.Z.93:9100']

In all three nodes, the firewall is configured to allow all traffic from the other two peer nodes.

In Grafana, I can see that the metrics are reported correctly for the two peers of the node in which Prometheus is running, but there is no report for the node in which Prometheus is running. This is the same regardless of the node that runs Prometheus.

What am I missing here?,

why is Prometheus not "seeing" the metrics from node-exporter for the node in which it is running?

Any hint will be much appreciated.


Solution

  • I just had to set a rule in the FW of the host running the prometheus service (only one in my case, no replicas) to allow 172.18.0.0/16 on port 9100.

    Reason: Docker stack services running in containers inside a host use network docker_gwbridge to communicate with the host. In my case, port 9100 was only open to peer swarm nodes, but not to docker_gwbridge. This docker network used subnet 172.18.0.0/16 in my case.

    Obviously, this is a swarm and when I remove the stack and run it again, the same service may run on different host(s). This may also happen at any time if the service replica is restarted. To be safe I configured docker_gwbridge in all nodes in the swarm as follows:

    docker network create \
    --subnet 172.18.0.0/16 \
    --opt com.docker.network.bridge.name=docker_gwbridge \
    --opt com.docker.network.bridge.enable_icc=false \
    --opt com.docker.network.bridge.enable_ip_masquerade=true \
    docker_gwbridge
    

    As described here.

    Probably not the most elegant solution, but the only one I found that works in this case. In any case, I have moved this part of my project to GrafanaLabs, which covers my needs well for this ;-)