Search code examples
google-cloud-platformgoogle-cloud-monitoring

Google Monitoring AlertPolicies notification spam due to threshold duration


Working with GCP Monitoring, I want to set up an alert based on GCP Uptime Check metric. My alert is working with a threshold above 1 for a duration of 1min.

My problem is that I am getting spam by notifications due to the short duration when the time serie is spiky. But I do want to keep a short duration to get the first notification quickly.

i.e. In the following image:

I am getting a first alert notification at 8:21 (after 1min) over the threshold, Great! But then I will get a resolved notification at 8:22, a new alert notification at 8:23 and finally a resolved notification at 8:28.

So I am getting 4 notifications when I would like to only receive 2. I miss the option to set a threshold duration to fire an alert and another threshold duration to resolve the incident. For my case, I would set 1min duration to fire and only 10min to resolve it.

GCP Uptimecheck

Can someone help with this issue?

Thank you for your help!


Solution

  • I don't think you can achieve what you want with the GCP Alerting Policy.

    To have a better understanding I would suggest reading Alerting behavior.

    In easy words, Period in GCP Alerting Policy is:

    The alignment period is a look-back interval from a particular point in time. For example, when the alignment period is five minutes, at 1:00 PM, the alignment period contains the samples received between 12:55 PM and 1:00 PM. At 1:01 PM, the alignment period slides one minute and contains the samples received between 12:56 PM and 1:01 PM.

    and Duration is time for how long the value is above threshold. However, there is another important message:

    A condition resets its duration window each time a measurement doesn't satisfy the condition. This behavior is illustrated in the following example:

    If I understand based on your comments, you want to get Alert notif at 8:21 and RESOLVED notif at 8:28.

    However, you set Period 1 minute and Duration also for 1 minute. Policy Algorithm recognized this as a 2 separate incidents as both fulfill all conditions. At the last minute, the value threshold was above the threshold.

    • 1st incident: ~08:19:40 AM to ~08:21:05 AM - so duration when value was higher than threshold was higher than 1 minute (~ 1'25s)
    • 2nd incident: ~08:21:50 AM to 08:27:30 AM - here duration when value was higher than threshold was also longer than 1 minute (~5'40s)

    So both alerts were intended with your configuration. It's described in Incidents for metric-based alerts

    An incident is a record of the triggering of an alerting policy. Cloud Monitoring opens an incident when a condition of an alerting policy has been met. The incident contains information you can use to investigate the cause of the alert.

    Feature which you are asking might be implemented in some 3rd party software monitoring tools but not here.

    Only thing which comes to my mind is to change duration/period to reduce false positives.

    There is a good video on youtube which explains Alerting Policy - here since 5:39.

    In general:

    • Alerts are generated each time when condition/s is/are met, like mentioned in Introduction to alerting
    • You cannot have 1 alert with two occurred events. Each event will trigger an alert.