I'm quite new to prometheus, I have an alert with for: 1h
I would like to know what should be the eval_time set when testing for alerts?
currently it fails, and works only if the alert: 1m and eval_time: 10m
Please someone explain me how this works!
example:
groups:
- name: spark.rules
rules:
##### ALERTS #####
- alert: xxxSparkJobsNotConsuming
expr: rate(foo{data_type="bar"}[5m]) == 0
for: 1h ---> works if i set it to 5m
labels:
service: spark
severity: warning
source: spark
annotations:
description: 'Nothing have been consumed for 1 hour.'
Tests:
rule_files:
- spark.rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'foo{data_type="bar"}'
values: '0'
- series: 'foo{data_type="bar"}'
values: '64706+0x10'
alert_rule_test:
- eval_time: 2h
alertname: xxxSparkJobsNotConsuming
exp_alerts:
- exp_labels:
data_type: xxxx
service: spark
severity: warning
source: spark
exp_annotations:
description: 'Nothing have been consumed for 1 hour.'
promql_expr_test:
- expr: 'foo'
eval_time: 4m
exp_samples:
- labels: 'foo{data_type="bar"}'
value: 64706
[You use both "xx"
, "xxx"
and "xxxx"
for the value of data_type
across your message, so I'm not sure what's what. Maybe next time try to use "foo"
, "bar"
, "baz"
. Or "1"
, "2"
, "3"
. In particular, both of the input_series
you define have the exact same name and label values. I will assume that's not the case in your actual test, and I'll call them series1
and series2
.]
Now, leaving that aside, your test defines 2 time series, with samples 1 minute apart:
series1: 0
series2: 64706 64706 64706 64706 64706 64706 64706 64706 64706 64706 64706
series1
has just the one sample, meaning your alert, that uses rate
will never trigger on it, since rate
needs at least 2 samples to produce a result.
series2
has a constant value of 64706 for 11 minutes, meaning the rate(series2[5m]) == 0
expression in your alert will hold between minutes 1 (the first time you have 2 samples in your 5m
range) and 15 (the last time when the 5m
range contains 2 samples). Meaning that your alert will fire for 14 minutes (or during a successive 15 minutes, not sure which definition for: X
uses).
In other words, any value of for: X
up to 14m (or 15m) will result in your alert firing at one time or another. Any value larger than that will result in the alert never firing, because the condition never holds for that long. (That being said, the alert_rule_test
has eval_time: 2h
, which I take to mean "the alert must be firing at 2h
from the start" and that should never be the case, regardless of what value you use in for:
.)
Anyway, the point is that you need series2
to cover at least as much as for:
in order for that condition to hold. If you have 10 minutes worth of samples, the condition cannot hold for 2 hours (unless, of course, the condition is that there are no recent samples).
I don't know what to say about eval_time: 2h
, I guess you'll just have to play around with it and see why it doesn't appear to do what it says on the tin.