Search code examples
monitoringmetricsprometheus

Prometheus increase not handling process restarts


I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.

When there is a process restart within a 2m interval and I query:

sum(increase(my_metric_total[2m])) 

I get a value less than expected.

For example, in a simple experiment I mock:

  • 3 lcm_restarts
  • 1 process restart
  • 2 lcm_restarts

All within a 2 minute interval.

Upon querying:

sum(increase(lcm_restarts[2m])) 

I receive a value of ~4.5 when I am expecting 5.

lcm_restarts graph

sum(increase(lcm_restarts[2m])) result

Could someone please explain?


Solution

  • Pretty concise and well-prepared first question here. Please keep this spirit!

    When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.

    For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.

    Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.

    So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.