I have a Gauge metric in Prometheus that shows a count of not-processed metrics. The count goes up and down because there are several jobs that process the records.
In Grafana, I show it as a "Time series" widget for avg(not_processed_records_total)
.
And it looks great, but a common pattern is when the count spikes, and then for a few days, it goes gradually down:
In such situations, I usually need to know how the metric's value changed, say, in 24h. E.g. on the screenshot above, the last value is about 607.000, and 24h ago it was about 684.000. So the delta is 684.000 - 607.000 = 77.000. I want to see this number in a Stat widget.
However, when I try this:
avg(-delta(not_processed_records_total[24h]))
I get about 3.000 which looks more like a change in the last hour:
I don't really get why and will appreciate your advice.
To answer possible questions about my PromQL query:
I use avg
because the metric is collected by several workers where each worker lives about 1h. Each worker gets a record count and then just decrements the number after it processes a record. So the longer the worker works, the more it deviates from the real count because the other workers process records too. It probably smells bad, and I need to use a Counter for it, but still, I'm interested to understand how to solve this with the existing Gauge metric.
Also, the minus before the delta
is just for convenience because in my case, the decrease is good and the increase is bad.
UPDATE:
This is how non-aggregated data looks like (the query is not_processed_records_total
):
I.e., each worker runs for an hour. It counts the number of records from the database. It's a heavy operation, so each worker does it only once and then just decrements the Gauge metric each time a record is processed. This accumulates an insignificant deviation from the actual number because other processes can also change the number of records in the database. But as you can see on the screenshot, no big jumps are on the line. Actually, avg
, min
, or max
aggregations show about the same result (see the first screenshot in the question).
Now, when I run this:
-delta(not_processed_records_total[24h])
,
And if I select only one of the workers:
So each worker lives for an hour, and when it ends, another starts. It's only a few minutes when there are two workers operating at the same time. The delta-line on the chart starts from a small value at 07:00, grows to a 3542 in an hour, and stays 3542 until it starts going down at 07:00 on the next day and ends up with a small value at 08:00.
So, it seems like delta
only counts the metric's value of a single worker, and @markalex is right that if I sum
all deltas, I will get what I need:
sum(-delta(not_processed_records_total[24h]))
Now the chart looks like a realistic rate of processed records per 24 hours.
Problem lies within your use of avg
. Remove it to see raw data, and extrapolate from there. most likely for described scenario sum(-delta(...))
should produce something close to what you want.
Additionally, I highly recommend to consider introducing separate worker exposing actual real data to Prometheus.
Explanation why this happens:
Let's imagine you have two workers: one worked whole last day and processed 77.000 items, and the second one didn't processed any.
Their respective values now are 607.000 and 684.000, where yesterday they both were at 684.000. As a result delta
for one will return 77.000 and 0 - for the other. And avg
will return "just" 38.500. Scale to actual number of your workers and the fact that they probably process more or less same number of items, and result of 3000 becomes feasible.