Search code examples
stackdrivergoogle-cloud-run

Resolve stackdriver incident when no more timeseries with available data violate the policy


I have stackdriver alerts/incidents on metrics like cloud run revision request latencies.

If there were a few calls a long time ago that had high latency, but there have not been any new requests since then which had a low latency, the incident will be permanently firing. This is because when there are no new requests coming in, there are no data points for the metric.

Is there a way to automatically stop an incident from firing when there are no recent data points for the underlying metrics? Or is there an alternative way to have alerts on high request latencies in cloud run that automatically switches off the alarm again when no new requests are coming that have a high latency?


Solution

  • Edit: This solution will not work because the request count stops being sent to Stackdriver instead of dropping to zero. As explained in the other (more correct) answer, the solution is to create a logs-based metric for the requests, and this will properly drop to zero when there are no additional requests.


    This behaviour is documented in the alerting docs:

    If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions.

    There are a few recommendations in there to mitigate this issue, but all the suggestions assume you're actually collecting metrics, not your situation where there are no metrics at all (because you stopped receiving requests).

    This is probably by design: even if you are not receiving additional requests, you might still want to check why all the latest requests had this increased latency.

    To work around this feature, you could try to use multiple conditions in your alert policy:

    • One condition related to the latency: if latency > X
    • One condition related to the existence of requests: if request count > 1

    If you combine those with AND_WITH_MATCHING_RESOURCE, it should only trigger if there's high latency and there are requests. The incident should be resolved when one of the 2 conditions are not met. Even if no new metrics are ingested related to the latency (so the alerting policy still thinks the latency is high), the request count will stop matching after the duration period specified.

    alert config