I am trying to visualize the percentage of requests that resulted in error (minute by minute) via a Grafana gauge but the Gauge is not showing the correct value. For example when I execute 10 requests within a 1 minute interval - where 5 of those requests result in HTTP 200 and 5 result in HTTP 500 then I expect the Gauge to show 50% error percentage. However the value stays at 100% regardless that I have been sending successful and unsuccessful requests to the API :
This is the corresponding query :
100 * (sum(sum_over_time(total_requests_gauge{status_code!="200"}[1m]))/ on() group_left() sum(sum_over_time(total_requests_gauge[1m])))
I have configured the Gauge unit to Percent :
On the client side this is how I have set up the Prometheus exporter :
MetricsReporter.cs
public class MetricReporter
{
private readonly ILogger<MetricReporter> _logger;
private readonly Counter _requestCounter;
private readonly Gauge _requestGauge;
private readonly Histogram _responseTimeHistogram;
public MetricReporter(ILogger<MetricReporter> logger)
{
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
_requestCounter = Metrics.CreateCounter("total_requests", "The total number of requests serviced by this API.");
_requestGauge = Metrics.CreateGauge("total_requests_gauge", "The total number of requests serviced by this API.");
_responseTimeHistogram = Metrics.CreateHistogram("request_duration_seconds",
"The duration in seconds between the response to a request.", new HistogramConfiguration
{
Buckets = Histogram.ExponentialBuckets(0.01, 2, 10),
LabelNames = new[] { "status_code", "method" , "path"}
});
}
public void RegisterRequest()
{
_requestCounter.Inc();
_requestGauge.Inc();
}
public void RegisterResponseTime(int statusCode, string method,string path, TimeSpan elapsed)
{
_responseTimeHistogram.Labels(statusCode.ToString(), method, path).Observe(elapsed.TotalSeconds);
}
}
Prometheus is scrapping the metrics correctly at http://localhost:9090 as well as the API endpoint at http://localhost:80/metrics
I also have an endpoint that always returns error responses :
[AllowAnonymous]
[HttpPost("problem")]
public IActionResult Problem([FromBody] RegisterModel model)
{
//always returns HTTP 500 error
return Problem();
}
What am I missing?
You are using the wrong prometheus function for this use case - sum_over_time(). Rather I would increase() for easy calculation.
The increase() calculates how much a counter increased in the specified interval. The sum_over_time() calculates the sum of all values in the specified interval.
Here is the query that I tested and worked fine for me:
sum(increase(http_server_requests_seconds_count{namespace="",pod_name=~"",uri=~"",status!="200"}[1m]))/sum(increase(http_server_requests_seconds_count{namespace="",pod_name=~"",uri=~""}[1m])) *100
Looks like you are using a custom metric, so change the metric name and filter params accordingly.