I'm trying to handle multiple (around 500) metrics with similar names like:
INSTANCE03{INSTANCE03="Dead"} == 1
INSTANCE05{INSTANCE05="Dead"} == 1
INSTANCE07{INSTANCE07="Dead"} == 1
Each of them is specified as an Enum which shows status like this:
INSTANCE03{INSTANCE03="Dead"} == 1
INSTANCE03{INSTANCE03="Alive"} == 0
Is there a way to make an alert for switching the status from Alive to Dead for all those metrics in some short way e. g. using regex for __name__ value?
Alerting works if I specify one metric instance per line but it's not a clean way for so many metrics.
Below my alert_rules.yml
groups:
- name: example
rules:
- alert: InstanceDown
expr: INSTANCE03{INSTANCE03="Dead",instance="127.0.0.1:8888",job="prometheus"} == 1
for: 15s
annotations:
summary: "Instance is down."
description: "Instance down for 15 seconds. Please check mentioned instance."
You could use a labelmap
action in metric_relabel_configs
to fix up these label and metric names.
As Alin says though, fixing the source of the metrics is best. A gauge with a 0/1 per instance would be simplest.