We use a Prometheus alert (and node-exporter) to check whether we are running out of memory on a node.
Issue: In many cases I get an alert with a $value that is below the threshold value in the expression.
The expression is:
alert: GettingOutOfMemory
expr: max(sum
by(instance) ((((node_memory_MemTotal_bytes) - (node_memory_MemFree_bytes + node_memory_Buffers_bytes
+ node_memory_Cached_bytes)) / (node_memory_MemTotal_bytes)) * 100)) >= 90
for: 5m
labels:
severity: warning
annotations:
description: Docker Swarm node {{ $labels.instance }} memory usage is at {{ humanize $value}}%.
summary: Memory is getting low for Swarm node '{{ $labels.node_name }}'
I get messages saying that we ran out of memory at e.g. 83%. So that is the value of the $value. This is clearly below the 90% threshold.
Why do I get this alert even though the $value is below the threshold?
How can I repair this Prometheus alert rule so I will only get only alerts when the $value is above the threshold?
The problem will result from using 'max'.
A better query is given below:
- alert: high_memory_load
expr: ((1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100) > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
An alternative is:
expr: ((1-((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes))*100)>85