I am very new to Prometheus and have the following alert in Prometheus whose goal is to get triggered when number errors in the total number of requests is higher than 5 %:
sum(increase(errorMetric{service_name="someservice"}[5m])) / sum(increase(http_requests_count{service_name="someservice", path="/some/path"}[5m])) > 0.05
I have an overall idea of the traffic and it can range between 100 requests per hour over 24h interval. How valuable is to have the interval set for 5m? Shall this range over a longer period of time, e.g. 1h. This alert goes off and it does not really inform us of a problem. What is your view?
Thank you
Buried in the mass Prometheus docs, there is a paragraph for increase
function:
increase
should only be used with counters and native histograms where the components behave like counters. It is syntactic sugar forrate(v)
multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.
So answer your questions:
Is there a strong reason as why I should use rate
as opposed to increase
?
Yes, use the rate
function.
How valuable is to have the interval set for 5m?
Not so valuable. Since your RPS/QPS is very small - less than 10 per 5m, you may get some 5m time ranges with little or zero requests and others with much more requests. The alert rule will be too sensitive or just wrong in a wider time range view. 30m or 1h range might be better.
By the way, time series on each side of division
operator should have matching labels to make the alert rule work.