Search code examples
prometheus

How to parameterize the Prometheus rule for a small subset of instances?


I need to create a rule for Prometheus that compares the value of some metric to a threshold value. The threshold value is the same for most of instances, but differs for two or three. Is there an easy and reliable way to parameterize a rule?

May be something like this:

- alert: HighValueAlert
  expr: my_metric > my_metric_threshold
  for: 5m

Where my_metric_threshold is an "artificial" metric which is defined somewhere e.g. using Node exporter textfile collector (or perhaps using another method that I have no idea about):

my_metric_threshold{instance="special1"} 101
my_metric_threshold{instance="special2"} 102
my_metric_threshold 100  # default for most of instances

Wishes for reliability:

  1. If I forgot to specify a special threshold value for some instance, it should be processed with the default threshold value.
  2. If a metric with a default threshold value is missing too (for example, I accidentally deleted .prom file), I should receive some kind of alert about the incorrect configuration (perhaps using a separate rule).

I'm new to Prometheus and I couldn't find any examples of solving this problem.

I'm setting up indoor temperature monitoring rules. The upper temperature threshold for most rooms is the same, but for 2-3 rooms it needs to be increased. Otherwise we will get too frequent alerts. The same applies to the lower temperature threshold.


Solution

  • Your general approach is correct. But here are a couple suggestions how to make your life easier.

    • Make names of special thresholds different from default one.
    my_metric_threshold_special{instance="server1"} 101
    my_metric_threshold_special{instance="server42"} 102
    my_metric_threshold_default 100  
    # if you'll expose it through textfile collector too, it will also have an instance label.
    # It doesn't matter, just make sure to expose it only once.
    
    • Separate alert rules for default case and special cases
    - alert: HighValueAlertDefault
      expr: my_metric > on() group_left() my_metric_threshold_default unless on(instance) my_metric_threshold_special
    - alert: HighValueAlertSpecial
      expr: my_metric > on(instance) group_left() my_metric_threshold_special
    

    Here default alert rule compares ignoring all labels, and disregards metrics of instances that have special thresholds.

    And special alert rule simply compares metric to threshold.


    To check my_metric_threshold_default you can use expression

    absent(my_metric_threshold_default) or count(my_metric_threshold_default)>1