Search code examples
apache-kafkaprometheusgrafanametricsgrafana-alerts

Detecting Only Increasing Prometheus Metrics Over A Given Interval


Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.

I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:

  • consumergroup - The name of the consumer group
  • topic - The name of the topic being consumed from

The typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)) looks something like this:

enter image description here

However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.

How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")

At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600

This has an associated alerting condition of:

  • EVALUATE every 1m for 5m (since if it's already been increasing, we'd want to alert after 5 minutes)
  • WHEN last() of query(queryName, 5m, now) is above 0

This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.


Solution

  • For those that encounter this in the future, I solved the issue by simply tracking a delta with the previous minute and an alerting threshold that would detect a positive value for a given period of time:

    sum(
       kafka_consumergroup_lag{consumergroup="my-consumer-group"} - 
       kafka_consumergroup_lag{consumergroup="my-consumer-group"} offset 1m
    ) by (consumergroup, topic)
    

    I'd imagine this could also be accomplished by the delta() function as well:

    sum(
       delta(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1m])
    ) by (consumergroup, topic)
    

    It would function as a "trend" where a positive value dictates an increase and a negative would correspond to a decrease. By using an alerting threshold of evaluating every minute for n minutes to determine if an alert should be fired.