Search code examples
prometheusprometheus-alertmanager

How to test an alert for hourly?


I'm quite new to prometheus, I have an alert with for: 1h I would like to know what should be the eval_time set when testing for alerts? currently it fails, and works only if the alert: 1m and eval_time: 10m

Please someone explain me how this works!

example:

groups:
- name: spark.rules
  rules:
  ##### ALERTS #####
  - alert: xxxSparkJobsNotConsuming
    expr: rate(foo{data_type="bar"}[5m]) == 0
    for: 1h ---> works if i set it to 5m
    labels:
      service: spark
      severity: warning
      source: spark
    annotations:
      description: 'Nothing have been consumed for 1 hour.'

Tests:

   rule_files:
      - spark.rules.yml
    evaluation_interval: 1m

    tests:
     - interval: 1m
       input_series:
        - series: 'foo{data_type="bar"}'
          values: '0'
        - series: 'foo{data_type="bar"}'
          values: '64706+0x10'
       alert_rule_test:
        - eval_time: 2h
          alertname: xxxSparkJobsNotConsuming
          exp_alerts:
           - exp_labels:
               data_type: xxxx
               service: spark
               severity: warning
               source: spark
             exp_annotations:
               description: 'Nothing have been consumed for 1 hour.'
       promql_expr_test:
        - expr: 'foo'
          eval_time: 4m
          exp_samples:
            - labels: 'foo{data_type="bar"}'
              value: 64706

Solution

  • [You use both "xx", "xxx" and "xxxx" for the value of data_type across your message, so I'm not sure what's what. Maybe next time try to use "foo", "bar", "baz". Or "1", "2", "3". In particular, both of the input_series you define have the exact same name and label values. I will assume that's not the case in your actual test, and I'll call them series1 and series2.]

    Now, leaving that aside, your test defines 2 time series, with samples 1 minute apart:

    series1: 0
    series2: 64706 64706 64706 64706 64706 64706 64706 64706 64706 64706 64706
    

    series1 has just the one sample, meaning your alert, that uses rate will never trigger on it, since rate needs at least 2 samples to produce a result.

    series2 has a constant value of 64706 for 11 minutes, meaning the rate(series2[5m]) == 0 expression in your alert will hold between minutes 1 (the first time you have 2 samples in your 5m range) and 15 (the last time when the 5m range contains 2 samples). Meaning that your alert will fire for 14 minutes (or during a successive 15 minutes, not sure which definition for: X uses).

    In other words, any value of for: X up to 14m (or 15m) will result in your alert firing at one time or another. Any value larger than that will result in the alert never firing, because the condition never holds for that long. (That being said, the alert_rule_test has eval_time: 2h, which I take to mean "the alert must be firing at 2h from the start" and that should never be the case, regardless of what value you use in for:.)

    Anyway, the point is that you need series2 to cover at least as much as for: in order for that condition to hold. If you have 10 minutes worth of samples, the condition cannot hold for 2 hours (unless, of course, the condition is that there are no recent samples).

    I don't know what to say about eval_time: 2h, I guess you'll just have to play around with it and see why it doesn't appear to do what it says on the tin.