The task is to implement SLI for Application/Service Availability for 5 mins using Prometheus but the catch is not using the Prometheus UP metric, Here I have to use the spring health(UP/Down) metric to track availability. So please can anyone help here, Thanks In advance.
Availability result should be in 0/1 0 - Not available 1 - available
So to expose health metric on Prometheus I tried the below code. Or is there any better way to calculate availability using spring health metrics? Note: Cannot use any prober as it is restricted in business.
@Configuration(proxyBeanMethods = false)
public class MyHealthMetricsExportConfiguration {
public MyHealthMetricsExportConfiguration(MeterRegistry registry, HealthEndpoint healthEndpoint) {
// This example presumes common tags (such as the app) are applied elsewhere
Gauge.builder("health", healthEndpoint, this::getStatusCode).strongReference(true).register(registry);
}
private int getStatusCode(HealthEndpoint health) {
Status status = health.health().getStatus();
if (Status.UP.equals(status)) {
return 1;
}
return 0;
}
}
Now it is showing health gauze value = 1, as localhost service is up and running. I have 2 instances running locally.
I tried the below promql query
sum by (application) (avg_over_time(health{application="My Service"}[5m]))
so my both instances are running it is giving 2 as value but I need the result to consider the application as a whole, i.e if both instances are running the result should be 1. And If one of the instances is down the result should be also 1.
And if both instances are down promql query should return 0 as a result.
How to achieve this, Any help will be appreciated, Thanks In advance.
well, there are a few options:
I'll be assuming the health
metric you're using is similar to the up
metric usually present with the common Prometheus exporters.
To the best of my understanding, the first option is the better one for your use case.
is that case both instances might be restarting due to bad data or config and even if at any moment one would be up you'd want to know of such behavior, in that case calculating the average over time may be better but you'd want to get the max of the two instances to verify that at least one was running correctly: max by (application) (avg_over_time(health{application="My Service"}[5m]))
the complete rule example:
groups:
- name: MyServiceRuleGroup
rules:
- alert: ServiceUptimeIsCompromised
expr: max by (application) (avg_over_time(health{application="My Service"}[5m])) < 1
for: 1m
labels:
severity: critical
annotations:
summary: service {{ $labels.application }} uptime has fallen below the threshold
the reason to use max
over avg
is to figure out whether at least one of the instances was up completely on its own (which is presumably what we want to check) if we'd like to know the general state of the application we'd use avg
(theoretically only, this isn't the best option for that need)
if you'd want to check a more immediate fluctuation you might reduce the time span to average the results (in the query itself)
this query would make more use when the app doesn't have an auto-heal solution in place, for example if a process or server crashes and needs to be started up manually.
You can check sum by(application)(health{application="My Service"}) < 1
The complete rule example:
groups:
- name: MyServiceRuleGroup
rules:
- alert: ServiceUptimeIsCompromised
expr: sum by(application)(health{application="My Service"}) < 1
for: 5m
labels:
severity: critical
annotations:
summary: service {{ $labels.application }} uptime has fallen below the threshold
If you wish to be informed as the service was down you might change the groups[i].rules[j].for
variable to 1m
. Or you might keep it at 5m
if you only want to be informed of prolonged downtime by an error or other unhealable issue