Search code examples
google-cloud-platformgoogle-compute-enginemonitoringstackdriver

GCP Uptime Metric is giving unreliable alerts


Trying to get an alert when the GCE VM is in down state by creating Alerting Policy.

Metric: compute.googleapis.com/instance/uptime

Resource : VM instance

And made the configuration that in order to trigger an alert when this condition is absent for 3 minutes.

To simulate this above behavior , I have stopped the VM but it is not triggering an alert , meanwhile data is not visible in graph of the alerting policy

Have attached trigger configuration

enter image description here


Solution

  • None of the metrics are giving reliable alerts when the VM is in stopped state,which are compute.googleapis.com/instance/uptime or uptime of the monitoring agent or cpu utilization metrics until you create alerting poilicy with MQL - Monitoring Query language.

    "metrics associated with TERMINATED or DELETED Google Cloud resources are not considered for metric-absence policies. This means you can't use metric-absence policies to test for TERMINATED or DELETED Google Cloud VMs." https://cloud.google.com/monitoring/alerts/types-of-conditions#metric-absence

    So as per the above statement we cannot use metic absence policy for stopped vm - As It goes to terminated state after it stopped for sometime.The reason is , it calculates the instance stop time only when it becomes running state again.

    But when you configure the same condition with MQL with the same set of metrics , Metric-absence policies works without any issues.

    Sample:

    Instead of configuring the condition by selecting resource & metric , go to Query Editor and type the below query for getting the alert when the Development environment VM is not in running state for 3 minutes.

    fetch gce_instance
    | metric 'compute.googleapis.com/instance/uptime'
    | filter (metadata.user_labels.env == 'dev')
    | group_by 1m, [value_uptime_aggregate: aggregate(value.uptime)]
    | every 1m
    | absent_for 180s
    

    Not sure this is the bug or not , but this is limitation when we configure the alerting condition in a traditional way and we can resolve this by leveraging MQL.