Search code examples
prometheusprometheus-alertmanager

Delay Prometheus alert before changing from active to inactive


I have an alert in my Prometheus set up that sends an alert when someMetric > 100 has been valid for 5m and then resends the alert every 24h according to the configuration below:

prometheus-alert.yml

 - alert: TestAlert
          expr: someMetric > 100
          for: 5m

alertmanager-config.yml

repeat_interval: 24h

However someMetric has a behaviour where it can be "stable" above 100 (which means an alert is active) but every once in a while it drops to something below 100 for a single scraping before jumping back up above 100. This will cause an active alert to become inactive (resolved) then back to pending and active again after 5 min. This will cause Prometheus to resend the alert which is what I want to avoid.

Is there a way to configure Prometheus to have something similar to for: 5m, but for the transiction active -> inactive (resolved)?


Solution

  • You could use one of the aggregation-over-time promQL functions to 'filter out' the blips that dip below 100, in your example? In your case it sounds like max might work? The only down-side being that it could take a few minutes longer to end the alert once the value drops permanently below 100.

    - alert: TestAlert
          expr: max_over_time(someMetric[2m]) > 100
          for: 5m