We have an alert on a Kafka message lag and generally the throughput of this message stays really low. But on the last date of the month, it's supposed to be high. I want to set up correct alerting for this case.
The current query is similar to this:
avg by (topic_name) (message_received_delay_histogram_all_p95_value{actor!="", environment="production", application="foo"}) > 25000.000000
On the last day of the month, the lag can be around 500K.
You can add and day_of_month() != bool days_in_month()
to evaluate alert on all days except last day of the month. And using or
add your changed alert expression for last day of the month.
Your alert expression would be something like this:
(avg by (topic_name) (message_received_delay_histogram_all_p95_value{actor!="", environment="production", application="foo"}) > 25000.0
and on() day_of_month() != days_in_month())
or
(avg by (topic_name) (message_received_delay_histogram_all_p95_value{actor!="", environment="production", application="foo"}) > 500000.0
and on() day_of_month() == days_in_month())
This part and on() day_of_month() == days_in_month()
is optional and was added for better understanding.
But for visibility and readability sake I advise you to split two parts of or
expression into two separate alerts.
EDIT: some additional useful limitations in time:
Same, but for the first day of the month
and on() day_of_month() != 1)
Stop alert from being generated on first day of every month between 09:00 and 18:00
unless on () (day_of_month() == 1 and hour()>=9 <18)