Search code examples
prometheussnmpprometheus-alertmanager

Promentheus alert to trigger if status x = 2 and y != 3 in the last minute


Is there a way to create an alert for "If upsBasicOutputStatus != 2 AND (upsBasicOutputStatus != 4 over the last 1 minute)" within promenteus alerts

I have a Promentheus instance scraping SNMP data from a range of UPS's, as part of this i have also setup alerting in promentheus to alert the moment a UPS State changes to "On Battery", we want this to alert to moment it happens rather then wait for another scrape to occure

upsBasicOutputStatus != 2

Sadly this has the side effect of alerting when a self test takes place every two weeks. Adding the exclusion to the expresion was simple

upsBasicOutputStatus != 2 and upsAdvTestDiagnosticsResults != 4

This works some of the time, sadly it seems that the On Battery status last longer then the "Test in Progress" status so an alert is fired when the test ends but the UPS is still on battery

UPS Status timeline

I would rather not extent the for: as that would delay an actual alert going out and although we have PCNS system inplace to shutdown racks, in my experiance having someone on hand for the critical systems is needed for just in case it fails, which has happened

Full alert rule

    - alert: UPSState
      expr: upsBasicOutputStatus != 2 and upsAdvTestDiagnosticsResults != 4 #Notonline and not in self test
      labels:
        severity: "critical"
      annotations:
        summery: "UPS {{ $labels.instance }} is no longer online"
        description: "UPS has entered the state {{ $value }}"
        dashboard: "d/FBsdas/?orgId=1&refresh=10s&var-datasource={{ $labels.source }}&var-ups={{ $labels.instance }}"

Update

After trying the suggested rule form @markalex the unless_over_time seems the shift the data points later but ~5 seconds which then triggers the alert

Rawdata verses overtime


Solution

  • You can modify you expression in two steps:

    1. Replace and metric != 4 with unless metric == 4,
    2. Extend any presence of the value equal to 4 for a minute with last_over_time

    Additionally, since selector inside of last_over_time is not simple vector selector you need to use subquery syntax.

    upsBasicOutputStatus != 2 unless last_over_time((upsAdvTestDiagnosticsResults == 4)[1m:])