Search code examples
google-cloud-platformgoogle-cloud-monitoring

GCP Alerting Policy using MQL


I need to create an alerting policy based on error percentage and then evaluate or assign a priority to the incident based on the error percentage.

Is there a way to use the multi condition in this query which says if | condition val() > 1'%' then P3, if | condition val() > 5'%' then P2, | condition val() > 10'%' then P1 and then dynamically add this in the alert policy condition so that people notified with the relevant priority.

Below is my Query -

fetch l7_lb_rule :: logging.googleapis.com/user/apis
| filter metric.URL=~'https://myapi.com/users' 
| {
    filter metric.URL=~'https://myapi.com/users' && metric.status >= 500
  ;
    ident
  }
| group_by [metric.URL]
| ratio
| scale '%'
| every (5m)
| window (5m)
| condition val() > 10'%'

Solution

  • Here's the query example p13rr0m created using MQL as reference

    fetch api
        | metric 'serviceruntime.googleapis.com/api/request_count'
        | filter
            (resource.service == 'my-service.com')
        | group_by 10m, [value_request_count_aggregate: aggregate(value.request_count)]
        | every 10m
        | { group_by [metric.response_code_class],
            [response_code_count_aggregate: aggregate(value_request_count_aggregate)]
        | filter (metric.response_code_class = '5xx')
            ; group_by [],
        [value_request_count_aggregate_aggregate:
            aggregate(value_request_count_aggregate)] }
        | join
        | value [response_code_ratio: val(0) / val(1)]
        | condition gt(val(), 0.1)
    

    In this example, p13rr0m was using the request count for a service my-service.com. Aggregating the request count over the last 10 minutes and responses with response code 5xx. Additionally, aggregating the request count over the same time period, but all response codes. Then in the last two lines, computing the ratio of the number of 5xx status codes with the number of all response codes. Finally, created a boolean value that is true when the ratio is above 0.1 and that I can use to trigger an alert.