google-cloud-platform google-compute-engine monitoring stackdriver

GCP Uptime Metric is giving unreliable alerts

Trying to get an alert when the GCE VM is in down state by creating Alerting Policy.

Metric: compute.googleapis.com/instance/uptime

Resource : VM instance

And made the configuration that in order to trigger an alert when this condition is absent for 3 minutes.

To simulate this above behavior , I have stopped the VM but it is not triggering an alert , meanwhile data is not visible in graph of the alerting policy

Have attached trigger configuration

Solution

None of the metrics are giving reliable alerts when the VM is in stopped state,which are compute.googleapis.com/instance/uptime or uptime of the monitoring agent or cpu utilization metrics until you create alerting poilicy with MQL - Monitoring Query language.

"metrics associated with TERMINATED or DELETED Google Cloud resources are not considered for metric-absence policies. This means you can't use metric-absence policies to test for TERMINATED or DELETED Google Cloud VMs." https://cloud.google.com/monitoring/alerts/types-of-conditions#metric-absence

So as per the above statement we cannot use metic absence policy for stopped vm - As It goes to terminated state after it stopped for sometime.The reason is , it calculates the instance stop time only when it becomes running state again.

But when you configure the same condition with MQL with the same set of metrics , Metric-absence policies works without any issues.

Sample:

Instead of configuring the condition by selecting resource & metric , go to Query Editor and type the below query for getting the alert when the Development environment VM is not in running state for 3 minutes.

fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| filter (metadata.user_labels.env == 'dev')
| group_by 1m, [value_uptime_aggregate: aggregate(value.uptime)]
| every 1m
| absent_for 180s

Not sure this is the bug or not , but this is limitation when we configure the alerting condition in a traditional way and we can resolve this by leveraging MQL.