Search code examples
microservicesprometheusmetricsprometheus-alertmanager

Firing Alerts for an activity which is supposed to happen during a particular time interval(using Prometheus Metrics and AlertManager)


I am fairly new to Prometheus alertmanager and had a doubt regarding firing alerts only during a particular period

I have a microservice which receives a file and does some processing on it, which is only invoked when it gets a message through a Kafka queue. The aforementioned is supposed to come every day between 5 am and 6 am(UTC time). The microservice has a metric which is incremented by 1 every time it receives a file. I want to raise an alert if it does not receive a file in the interval. I have created a query like this :

    expr : sum(increase(metric_name[1m]) and on() hour(vector(time()))==5) < 1
    for: 1h

My questions:-
1) Is it correct or is there a better way to do it
2) In case of no update, will it return 0 or "datapoints not found"
3) Is increase the correct function as it tends to give results in decimals due to extrapolation, but I understand if increase is 0, it will show 0

I can't really play around with scrape_intervals, which is set at 30s.


Solution

  • I have not run this expression but I expect it will cause an alert to fire at 06:00 only and then go off at 06:01. It is the only time the expression would hold true for one hour.

    Answering your questions

    1. It is correct if what you want is a single fire of alert (sending a mail by example) but then no longer firing. Even with that, the schedule is a bit tight and may get hurt by alertmanager delay causing the alert to be lost.
    2. In case of no increase, you will get the expression will evaluate to 0. It will be empty when there is an update
    3. Increase is the right function. It even takes into account reset of the counter.

    Answering if there is a better way to do it.

    Regarding your expression, you can have the same result, without for clause, with:

    expr: increase(metric_name[1h])==0 and on() hour()==6 and on() minute()<1
    

    It reads a : starting at 6am and for 1 minutes, if there was no increase of metric over the lasthour.

    Alerting longer

    If you want the alert to last longer (say for the day and you silence it when it is solved), you can use sub-queries;

    expr: increase((metric and on() hour()==5)[18h:])==0 and on() hour()>5
    

    It reads as : starting at 6am (hour()>5), compute the increase over 5-6am for the next 18 hours. If you like having a pending, you can drop the trailing on() hour()>5 and use a for: 1h clause.

    If you want to alert until a file is submitted and thus detect a resolution, simply transform the expression to evaluate the increase until now:

    expr: increase((metric and on() hour()>5)[18h:])==0 and on() hour()>5