I have a problem understanding and/or implementing the alert logic in Prometheus. I have two alert rules:
alert: JobDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: Scrape job {{ $labels.job }} down on {{ $labels.hostname }}.
alert: HostDown
expr: sum(up) == 0
for: 5m
labels:
severity: critical
annotations:
description: All scrape jobs down on {{ $labels.hostname }}.
summary: Host {{ $labels.hostname }} down.
I would expect the HostDown alert to be triggered when all jobs are down, but it has not been the case: I have seen hosts being down, Prometheus was showing alerts for every scrape job, but did not fire the HostDown alert. Did I write the expression right?
sum
will ignore hostname
and sum over everything. To sum over hostname
, you need
sum by (hostname) (up) == 0
NB: hostname is not a standard label
on up
, it's a custom label in the configuration of the original poster