Search code examples
prometheusmetricspromqlgrafana-loki

Grafana gauge for percentage error rate not showing correct calculation


I am trying to visualize the percentage of requests that resulted in error (minute by minute) via a Grafana gauge but the Gauge is not showing the correct value. For example when I execute 10 requests within a 1 minute interval - where 5 of those requests result in HTTP 200 and 5 result in HTTP 500 then I expect the Gauge to show 50% error percentage. However the value stays at 100% regardless that I have been sending successful and unsuccessful requests to the API :

gauge

This is the corresponding query :

100 * (sum(sum_over_time(total_requests_gauge{status_code!="200"}[1m]))/ on() group_left() sum(sum_over_time(total_requests_gauge[1m])))

I have configured the Gauge unit to Percent :

percent

On the client side this is how I have set up the Prometheus exporter :

MetricsReporter.cs

public class MetricReporter
{
private readonly ILogger<MetricReporter> _logger;
private readonly Counter _requestCounter;
private readonly Gauge _requestGauge;
private readonly Histogram _responseTimeHistogram;
    public MetricReporter(ILogger<MetricReporter> logger)
{
    _logger = logger ?? throw new ArgumentNullException(nameof(logger));

    _requestCounter = Metrics.CreateCounter("total_requests", "The total number of requests serviced by this API.");
    _requestGauge = Metrics.CreateGauge("total_requests_gauge", "The total number of requests serviced by this API.");

    _responseTimeHistogram = Metrics.CreateHistogram("request_duration_seconds",
        "The duration in seconds between the response to a request.", new HistogramConfiguration
        {
            Buckets = Histogram.ExponentialBuckets(0.01, 2, 10),
            LabelNames = new[] { "status_code", "method" , "path"}
        });
}

public void RegisterRequest()
{
    _requestCounter.Inc();
    _requestGauge.Inc();
}

public void RegisterResponseTime(int statusCode, string method,string path, TimeSpan elapsed)
{
    _responseTimeHistogram.Labels(statusCode.ToString(), method, path).Observe(elapsed.TotalSeconds);
}

}

Prometheus is scrapping the metrics correctly at http://localhost:9090 as well as the API endpoint at http://localhost:80/metrics

I also have an endpoint that always returns error responses :

[AllowAnonymous]
    [HttpPost("problem")]
    public IActionResult Problem([FromBody] RegisterModel model)
    {
        //always returns HTTP 500 error      
            return Problem();
    }

What am I missing?


Solution

  • You are using the wrong prometheus function for this use case - sum_over_time(). Rather I would increase() for easy calculation.

    The increase() calculates how much a counter increased in the specified interval. The sum_over_time() calculates the sum of all values in the specified interval.

    Here is the query that I tested and worked fine for me:

    sum(increase(http_server_requests_seconds_count{namespace="",pod_name=~"",uri=~"",status!="200"}[1m]))/sum(increase(http_server_requests_seconds_count{namespace="",pod_name=~"",uri=~""}[1m])) *100

    Looks like you are using a custom metric, so change the metric name and filter params accordingly.