Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.
I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag
that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:
consumergroup
- The name of the consumer grouptopic
- The name of the topic being consumed fromThe typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)
) looks something like this:
However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.
How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")
At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:
sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600
This has an associated alerting condition of:
1m
for 5m
(since if it's already been increasing, we'd want to alert after 5 minutes)last()
of query(queryName, 5m, now)
is above 0
This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.
For those that encounter this in the future, I solved the issue by simply tracking a delta with the previous minute and an alerting threshold that would detect a positive value for a given period of time:
sum(
kafka_consumergroup_lag{consumergroup="my-consumer-group"} -
kafka_consumergroup_lag{consumergroup="my-consumer-group"} offset 1m
) by (consumergroup, topic)
I'd imagine this could also be accomplished by the delta()
function as well:
sum(
delta(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1m])
) by (consumergroup, topic)
It would function as a "trend" where a positive value dictates an increase and a negative would correspond to a decrease. By using an alerting threshold of evaluating every minute for n minutes to determine if an alert should be fired.