Here I have a requirement.
If the response time of an Url keeps going up in a short-time-period-average, which means the increase rate is larger than 0, I need to raise an alert to warn that this url may be facing a potential danger, and then take some degrade or flow-limitation methods.
While testing, I have a metric named responseTime_POST__prometheus_webhook__test_prometheus_qps_stable
and responseTime_POST__prometheus_webhook__test_prometheus_qps_increasing
. I will call them increaseUrl
and stableUrl
in a short name for read afterwards.
increaseUrl
and stableUrl
are set as Gauge
to show the response time of 2 testing url.
Each time I request increaseUrl
's api, the response time is set to response with time-step increasing by 1 milisecond. To pretend this api do not have ability to process such quantity of requests.
And Each time I request stableUrl
's api, there are no other settings, just return succeed immediately.
It is OK to query derivation of response time of both increaseUrl
and stableUrl
by PromQL:
deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_stable[10s])
deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_increasing[10s])
The graph is also rather reasonable that stableUrl
will have a little vibration around value 0,and increaseUrl
is always have a value larger than 0.5 which means the response time is going up.
In this picture, top is stableUrl
and bottom is increaseUrl
. by using > 0
judgement, You can see the graph of increasUrl
will always larger than 0, and stableUrl
will not have that situation.
So I think, If I set an alert to judge derivation of response time and let the result > 0
to be true for a while, I can then get a correct alert of increaseUrl
and do some process afterwards.
It is OK if I'm using alert like this:
groups:
- name: automatic_flow_control
rules:
- alert: automatic_flow_control
expr: deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_stable[15s])>0.1
for: 1m
labels:
serverity: warn
annotations:
summary: gateway limitation alert has invoked
value: "value= {{ $value }}"
But actually, I have a bunch of urls and I have to manage them all with this alert. So I decide to use expr deriv(label_replace({__name__=~"responseTime.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2
to do such job.
As I think this query will find out all metrics with name starts with responseTime
and draw a
simple derivation map according to 4 points average( my prometheus scrape interval is 5s, which means 20s will contains 4 points ) and draw the map with time unit by 1s.
this query will draw graph like this:
The green line is increaseUrl
and it is greater than 0 almost all the time while testing.
Here is the testing alert rule:
groups:
- name: automatic_flow_control
rules:
- alert: automatic_flow_control
expr: deriv(label_replace({__name__=~"responseTime_.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2
for: 1m
labels:
serverity: warn
annotations:
summary: gateway limitation alert has invoked
value: "value= {{ $value }}"
My reuirement of this rule is to invoke an alert if any metric's calculated value by the expr
is greater than 0.2 for 1 minute.
When I reload the prometheus.yml
and enable the rule, this rule will always pending and never fire an alert, even though increaseUrl
will always greater than 0.2.
Although there are several points missing because increaseUrl
's value is less than 0.2. I'm sure that increaseUrl
has a lot of greater-than-1m intervals that the values are keeping greater than 0.2 which will invoke the alert for sure.
If I set the threshold to 0.1
there are more and larger intervals in the graph like this, but no alerts invoked by prometheus at all:
I don't know why this happens. Are there any bugs that I missed so I cannot get any alert?
I have found where the problem is.
refering this question
{{ $value }}
should not put in the alert rule. just remove that line, the problem is not showing again.
new alert rule shall be:
groups:
- name: automatic_flow_control
rules:
- alert: automatic_flow_control
expr: deriv(label_replace({__name__=~"responseTime_.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2
for: 1m
labels:
serverity: warn
annotations:
summary: gateway limitation alert has invoked
#remove this line below
#value: "value= {{ $value }}"