Search code examples
prometheuspromqlprometheus-alertmanagervictoriametricsmetricsql

Metric with changing label


I have a label called managed that it can be changed between 0 and 1 at anytime by the host machine. I have an alert that notifies when a metric is lagging behind by more than 90s.

However, it doesn't account for the managed label change so when the label changes, the alert would trigger but the server is fine. I have been trying several things but doesn't see a way to go forward. What I have atm:

(
   min(lag(load.load.shortterm{}[12h:]) keep_metric_names) by (fqdn) > 90s
)
+ on(fqdn) group_left(managed)
(
   0*lag(load.load.shortterm{}[12h:]) keep_metric_names
)

This will return 2 metrics with managed = 1 and managed = 0. However, I need the latest managed label to return so I know whether to escalate it or not. Do anyone have any recommendations on how I can archive my desire behaviour?


Solution

  • The lag function is calculated independently per each time series returned from the given series_selector. When you have a dynamic label, you have only one time series at a time. It means when the label changes from 1 to 0 the time series with managed=1 becomes stale (not updated anymore), and series with managed=0 becomes active. The lag for the first time series will start to grow since it gets no updates anymore. This is what triggers your alert.

    I suggest you to change the metric structure from load.load.shortterm{managed="<state>"} to load.load.shortterm.managed{} <state>. With this change, you'll always have only one time series and lag will work properly for it.