Search code examples
google-cloud-platformcloudgoogle-cloud-monitoring

Negative error budget even when Service Level Indicator(SLI) is greater than Service Level Objective(SLO)


I created a request based SLI where good service filter is CPU usage time by a specific VM instance ("instance-1" here) and total service filter is CPU usage by all VM instances. I set the SLO goal to 50%.

I thought that since SLI is greater than SLO the error budget should be positive but I am got negative error budget.What does this mean?

The service I used here is a custom service. The graph is about good/total ratio of CPU utilization for an instance "instance-1".

enter image description here


Solution

  • A service's error budget is the number of failures (errors or other bad events) that the service is allowed to experience over a given period of time.The error budget is calculated based on the Service Level Objective (SLO). For instance, if your SLO is 99.9% uptime, then your error budget is the remaining 0.1%. This is the allowable margin of error, or the amount of downtime or errors that are tolerable within a specific period.You can find the information on this document.

    The negative error budget of -84.97% indicates that the service has consumed more than the allotted error budget . This might occur if the service reliability drops significantly below the agreed Service Level Objective (SLO) , refer to this document for relevant info. But Service Level Indicator(SLI) is greater than Service Level Objective(SLO) and hence you need to investigate more on this. You can create an alert to monitor when the SLO has been breached. If you don’t get any alerts you can raise a support ticket at Public Issue Tracker report with the description of your issue . This Issue Tracker is a forum for end users to report bugs and request features to improve Google Cloud products.

    You can also look at the document for more details