Search code examples
prometheusprometheus-alertmanager

Prometheus alerts with a value that is below the threshold


We use a Prometheus alert (and node-exporter) to check whether we are running out of memory on a node.

Issue: In many cases I get an alert with a $value that is below the threshold value in the expression.

The expression is:

alert: GettingOutOfMemory
expr: max(sum
  by(instance) ((((node_memory_MemTotal_bytes) - (node_memory_MemFree_bytes + node_memory_Buffers_bytes
  + node_memory_Cached_bytes)) / (node_memory_MemTotal_bytes)) * 100)) >= 90
for: 5m
labels:
  severity: warning
annotations:
  description: Docker Swarm node {{ $labels.instance }} memory usage is at {{ humanize $value}}%.
  summary: Memory is getting low for Swarm node '{{ $labels.node_name }}'

I get messages saying that we ran out of memory at e.g. 83%. So that is the value of the $value. This is clearly below the 90% threshold.

Why do I get this alert even though the $value is below the threshold?

How can I repair this Prometheus alert rule so I will only get only alerts when the $value is above the threshold?


Solution

  • The problem will result from using 'max'.

    A better query is given below:

     - alert: high_memory_load
        expr: ((1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100) > 85
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "Server memory is almost full"
          description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
    

    An alternative is:

    expr: ((1-((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes))*100)>85