Search code examples
prometheuspromqlprometheus-alertmanager

how to perform a proper query with PromQL to get an increasing rate in prometheus


Here I have a requirement.

If the response time of an Url keeps going up in a short-time-period-average, which means the increase rate is larger than 0, I need to raise an alert to warn that this url may be facing a potential danger, and then take some degrade or flow-limitation methods.

While testing, I have a metric named responseTime_POST__prometheus_webhook__test_prometheus_qps_stable and responseTime_POST__prometheus_webhook__test_prometheus_qps_increasing. I will call them increaseUrl and stableUrl in a short name for read afterwards.

increaseUrl and stableUrl are set as Gauge to show the response time of 2 testing url.

Each time I request increaseUrl's api, the response time is set to response with time-step increasing by 1 milisecond. To pretend this api do not have ability to process such quantity of requests.

And Each time I request stableUrl's api, there are no other settings, just return succeed immediately.

It is OK to query derivation of response time of both increaseUrl and stableUrl by PromQL: deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_stable[10s]) deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_increasing[10s])
The graph is also rather reasonable that stableUrl will have a little vibration around value 0,and increaseUrl is always have a value larger than 0.5 which means the response time is going up.

In this picture, top is stableUrl and bottom is increaseUrl. by using > 0 judgement, You can see the graph of increasUrl will always larger than 0, and stableUrl will not have that situation. enter image description here

So I think, If I set an alert to judge derivation of response time and let the result > 0 to be true for a while, I can then get a correct alert of increaseUrl and do some process afterwards.

It is OK if I'm using alert like this:

groups:
- name: automatic_flow_control
  rules:
  - alert: automatic_flow_control
    expr: deriv(responseTime_POST__prometheus_webhook__test_prometheus_qps_stable[15s])>0.1
    for: 1m
    labels:
      serverity: warn
    annotations:
      summary: gateway limitation alert has invoked
      value: "value= {{ $value }}"

But actually, I have a bunch of urls and I have to manage them all with this alert. So I decide to use expr deriv(label_replace({__name__=~"responseTime.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2 to do such job.

As I think this query will find out all metrics with name starts with responseTime and draw a simple derivation map according to 4 points average( my prometheus scrape interval is 5s, which means 20s will contains 4 points ) and draw the map with time unit by 1s.

this query will draw graph like this: enter image description here

The green line is increaseUrl and it is greater than 0 almost all the time while testing.

Here is the testing alert rule:

groups:
- name: automatic_flow_control
  rules:
  - alert: automatic_flow_control
    expr: deriv(label_replace({__name__=~"responseTime_.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2
    for: 1m
    labels:
      serverity: warn
      annotations:
      summary: gateway limitation alert has invoked
      value: "value= {{ $value }}"

My reuirement of this rule is to invoke an alert if any metric's calculated value by the expr is greater than 0.2 for 1 minute.

When I reload the prometheus.yml and enable the rule, this rule will always pending and never fire an alert, even though increaseUrl will always greater than 0.2.

Although there are several points missing because increaseUrl's value is less than 0.2. I'm sure that increaseUrl has a lot of greater-than-1m intervals that the values are keeping greater than 0.2 which will invoke the alert for sure.

If I set the threshold to 0.1 there are more and larger intervals in the graph like this, but no alerts invoked by prometheus at all: enter image description here

I don't know why this happens. Are there any bugs that I missed so I cannot get any alert?


Solution

  • I have found where the problem is.

    refering this question

    {{ $value }} should not put in the alert rule. just remove that line, the problem is not showing again.

    new alert rule shall be:

    groups:
    - name: automatic_flow_control
      rules:
      - alert: automatic_flow_control
        expr: deriv(label_replace({__name__=~"responseTime_.*"}, "metricname", "$1", "__name__", "(.*)")[20s:1s])>0.2
        for: 1m
        labels:
          serverity: warn
          annotations:
          summary: gateway limitation alert has invoked
          #remove this line below
          #value: "value= {{ $value }}"