apache-kafka prometheus grafana metrics grafana-alerts

Detecting Only Increasing Prometheus Metrics Over A Given Interval

Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.

I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:

consumergroup - The name of the consumer group
topic - The name of the topic being consumed from

The typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)) looks something like this:

However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.

How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")

At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600

This has an associated alerting condition of:

EVALUATE every 1m for 5m (since if it's already been increasing, we'd want to alert after 5 minutes)
WHEN last() of query(queryName, 5m, now) is above 0

This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.

Solution

For those that encounter this in the future, I solved the issue by simply tracking a delta with the previous minute and an alerting threshold that would detect a positive value for a given period of time:

sum(
   kafka_consumergroup_lag{consumergroup="my-consumer-group"} - 
   kafka_consumergroup_lag{consumergroup="my-consumer-group"} offset 1m
) by (consumergroup, topic)

I'd imagine this could also be accomplished by the delta() function as well:

sum(
   delta(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1m])
) by (consumergroup, topic)

It would function as a "trend" where a positive value dictates an increase and a negative would correspond to a decrease. By using an alerting threshold of evaluating every minute for n minutes to determine if an alert should be fired.