Search code examples
prometheusprometheus-alertmanager

Host down alert in Prometheus when all scrape jobs are down


I have a problem understanding and/or implementing the alert logic in Prometheus. I have two alert rules:

alert: JobDown
expr: up == 0
for: 5m
labels:
  severity: warning
annotations:
  summary: Scrape job {{ $labels.job }} down on {{ $labels.hostname }}.

alert: HostDown
expr: sum(up) == 0
for: 5m
labels:
  severity: critical
annotations:
  description: All scrape jobs down on {{ $labels.hostname }}.
  summary: Host {{ $labels.hostname }} down.

I would expect the HostDown alert to be triggered when all jobs are down, but it has not been the case: I have seen hosts being down, Prometheus was showing alerts for every scrape job, but did not fire the HostDown alert. Did I write the expression right?


Solution

  • sum will ignore hostname and sum over everything. To sum over hostname, you need

    sum by (hostname) (up) == 0
    

    NB: hostname is not a standard label on up, it's a custom label in the configuration of the original poster