Search code examples
google-cloud-platformalertmetrics

GCP Alerting - how to set uptime threshold?


We are trying to setup GKE uptime alerts for our kubernetes cluster using the new (ie. non-legacy) GCP dashboard. Similar alerts are working fine for CPU/Memory utlization, but something about uptime is strange.

The policy shown below applies to our prod cluster and sets a rolling window of 10 minutes and applies count as the aggregator function. The count is the number of uptime minutes. On the right, I have turned off a service. You can see that this causes a stepwise change that gradually steps down from 10 (fully up) to zero (fully down over course of ten minutes). When the metric hits 0 (ie. below threshold value 1), we should get an alert.

enter image description here

enter image description here

However, rather than going to 0, the value of the count shows as "-" in the UI. I assume this means null or something? As you can see in the step function, the steps go 10,9,8....3,2,1 and then disappear without ever becoming 0. Why would the count not go to zero and instead go to "-" on the UI? The metric description itself shows it as a Double, so would epect it to go to 0...

enter image description here

Do we have the wrong metric, or should we just set the threshold to below 2 instead as a workaround? Or perhaps I should choose "Metric Absence" as the best way to track this?


Solution

  • I have reproduced the issue using the Metric Absence condition which triggered an alert when the metric has no data for a specific duration of time. Uptime alerts are generated only when the container is up and running. An alternative scenario is using custom metrics and adding time series data to them.

    EDIT

    A metric absence condition triggers if any time series in the metric has no data for a specific duration window. Except for metrics generated by an uptime check, metrics associated with TERMINATED or DELETED Google Cloud resources (VMs) are not considered for metric-absence policies.

    For creating an alert policy for the metric absence Condition type ,you need to select the metric (kubernetes container - uptime) and adjust the length of time of a signal in the rolling window. Set the rolling window function to count . set the condition type as Metric Absence in Configure alert trigger option.

    Select the Alert trigger to Any time series violates and trigger absence time to 5 min and create an alert policy by selecting the preferred notification channel type. So, if the container has no data then the metric absence condition triggers and will send an alert message to the respective notification channel.