Search code examples
prometheusconsul

How can I keep a node visible to Prometheus via Consul service discovery when that node's agent goes down?


I'm using service discovery in Prometheus using Consul, and it's working well for the most part. I have exporters running on my nodes, Consul agents running on these same nodes, and I've registered the exporter services in the Consul cluster via the agents (using REST calls to the agents). Prometheus is correctly finding the registered exporters and scraping the metrics. Also, Prometheus correctly sends an alarm when the registered service (exporter) is taken down. But...the problem is that when a node loses a Consul agent (either just the agent process OR the whole node goes down), the Consul cluster no longer sees the node at all! Then, Prometheus doesn't even know about the node, and therefore doesn't even try to scrape its exporter metrics. So, I don't get an alert. In other words, when an agent goes down on a node, it just disappears and I don't even know about it. I've tried "leave_on_terminate": false in the agent's agent.json config, but that doesn't make a difference.

Yes, I know I can use DNS service records for service discovery as well, which would keep the node visible in Prometheus even when a Consul agent goes down, but then I'd be double-scraping metrics all the rest of the time when the agent is up. I want to stick to only using the Consul paradigm for service discovery, and not mix the DNS service record approach in there as well. I'd also like to avoid monitoring the agents separately (i.e. via blackbox exporter).

Any ideas? Please help. Thanks!


Solution

  • We figured this out on this end. Everything is working now.

    Summary of solution: While having '"leave_on_terminate": false' in the agent.json config in the agent containers did allow the Consul cluster to show red when the agent container went down on a node (the original problem), Prometheus then just silently stopped scraping metrics on that node --and wouldn't alert (new problem with the same effect as the original problem). We ended up using the consul-exporter on the nodes as well, to post metrics on the node's consul agent. With that, Prometheus still wasn't alerting when taking down a Consul agent, but the consul-exporter metrics showed that it was down. We therefore added a Prometheus rule in the Consul part of the rules.yml config to raise an alert when the consul-exporter metrics showed the Consul agent was down. This worked end-to-end.