Search code examples
prometheusgrafana

Prometheus: compare the last gauge value to the value 24h ago


I have a Gauge metric in Prometheus that shows a count of not-processed metrics. The count goes up and down because there are several jobs that process the records.

In Grafana, I show it as a "Time series" widget for avg(not_processed_records_total). And it looks great, but a common pattern is when the count spikes, and then for a few days, it goes gradually down: enter image description here

In such situations, I usually need to know how the metric's value changed, say, in 24h. E.g. on the screenshot above, the last value is about 607.000, and 24h ago it was about 684.000. So the delta is 684.000 - 607.000 = 77.000. I want to see this number in a Stat widget.

However, when I try this:

avg(-delta(not_processed_records_total[24h]))

I get about 3.000 which looks more like a change in the last hour: enter image description here

I don't really get why and will appreciate your advice.


To answer possible questions about my PromQL query: I use avg because the metric is collected by several workers where each worker lives about 1h. Each worker gets a record count and then just decrements the number after it processes a record. So the longer the worker works, the more it deviates from the real count because the other workers process records too. It probably smells bad, and I need to use a Counter for it, but still, I'm interested to understand how to solve this with the existing Gauge metric.

Also, the minus before the delta is just for convenience because in my case, the decrease is good and the increase is bad.


UPDATE: This is how non-aggregated data looks like (the query is not_processed_records_total): enter image description here I.e., each worker runs for an hour. It counts the number of records from the database. It's a heavy operation, so each worker does it only once and then just decrements the Gauge metric each time a record is processed. This accumulates an insignificant deviation from the actual number because other processes can also change the number of records in the database. But as you can see on the screenshot, no big jumps are on the line. Actually, avg, min, or max aggregations show about the same result (see the first screenshot in the question).

Now, when I run this:

-delta(not_processed_records_total[24h]),

I get the following: enter image description here

And if I select only one of the workers: enter image description here

So each worker lives for an hour, and when it ends, another starts. It's only a few minutes when there are two workers operating at the same time. The delta-line on the chart starts from a small value at 07:00, grows to a 3542 in an hour, and stays 3542 until it starts going down at 07:00 on the next day and ends up with a small value at 08:00.

So, it seems like delta only counts the metric's value of a single worker, and @markalex is right that if I sum all deltas, I will get what I need:

sum(-delta(not_processed_records_total[24h]))

enter image description here

Now the chart looks like a realistic rate of processed records per 24 hours.


Solution

  • Problem lies within your use of avg. Remove it to see raw data, and extrapolate from there. most likely for described scenario sum(-delta(...)) should produce something close to what you want.

    Additionally, I highly recommend to consider introducing separate worker exposing actual real data to Prometheus.


    Explanation why this happens:

    Let's imagine you have two workers: one worked whole last day and processed 77.000 items, and the second one didn't processed any.

    Their respective values now are 607.000 and 684.000, where yesterday they both were at 684.000. As a result delta for one will return 77.000 and 0 - for the other. And avg will return "just" 38.500. Scale to actual number of your workers and the fact that they probably process more or less same number of items, and result of 3000 becomes feasible.