We are using Prometheus-Grafana. Now we want to set alert for CPU load average of 5 minutes.
We have 60 servers which have different CPU core like few machine have 1 core, 2 core, 6 core, 8 core etc.
The below Rule will give the result for load 5 minutes. But it will not differentiate machine is single core or multicore.
- name: alerting_rules
- alert: LoadAverage15m
expr: node_load5 >= 0.75
severity: major
summary: "Instance {{ $labels.instance }} - high load average"
description: "{{ $labels.instance }} (measured by {{ $labels.job }}) has high load average ({{ $value }}) over 5 minutes."
I have tried below rule but it also not working:
- alert: LoadAverage5minutes
expr: node_load5/count(node_cpu{mode="idle"}) without (cpu,mode) >= 0.95
for: 5m
severity: warning
summary: "Load average is high for 5 minutes (instance {{ $labels.instance }})"
description: "Load is high \n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Can you please help me what changes are required in my rule so it can work.
The following expression should work:
expr: node_load5 / count by (instance, job) (node_cpu_seconds_total{mode="idle"}) >= 0.95