Search code examples
prometheusalertpromql

Prometheus Error Rate alert : interval range question


I am very new to Prometheus and have the following alert in Prometheus whose goal is to get triggered when number errors in the total number of requests is higher than 5 %:

sum(increase(errorMetric{service_name="someservice"}[5m])) /  sum(increase(http_requests_count{service_name="someservice", path="/some/path"}[5m])) > 0.05

I have an overall idea of the traffic and it can range between 100 requests per hour over 24h interval. How valuable is to have the interval set for 5m? Shall this range over a longer period of time, e.g. 1h. This alert goes off and it does not really inform us of a problem. What is your view?

Thank you


Solution

  • Buried in the mass Prometheus docs, there is a paragraph for increase function:

    increase should only be used with counters and native histograms where the components behave like counters. It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.

    So answer your questions:

    1. Is there a strong reason as why I should use rate as opposed to increase?

      Yes, use the rate function.

    2. How valuable is to have the interval set for 5m?

      Not so valuable. Since your RPS/QPS is very small - less than 10 per 5m, you may get some 5m time ranges with little or zero requests and others with much more requests. The alert rule will be too sensitive or just wrong in a wider time range view. 30m or 1h range might be better.

    By the way, time series on each side of division operator should have matching labels to make the alert rule work.