spring-boot prometheus spring-webflux spring-boot-actuator promql

Implement SLI for Application/Service Availability using Promethues

The task is to implement SLI for Application/Service Availability for 5 mins using Prometheus but the catch is not using the Prometheus UP metric, Here I have to use the spring health(UP/Down) metric to track availability. So please can anyone help here, Thanks In advance.

Availability result should be in 0/1 0 - Not available 1 - available

So to expose health metric on Prometheus I tried the below code. Or is there any better way to calculate availability using spring health metrics? Note: Cannot use any prober as it is restricted in business.

@Configuration(proxyBeanMethods = false)
public class MyHealthMetricsExportConfiguration {


    public MyHealthMetricsExportConfiguration(MeterRegistry registry, HealthEndpoint healthEndpoint) {
        // This example presumes common tags (such as the app) are applied elsewhere
        Gauge.builder("health", healthEndpoint, this::getStatusCode).strongReference(true).register(registry);
    }

    private int getStatusCode(HealthEndpoint health) {
        Status status = health.health().getStatus();
        if (Status.UP.equals(status)) {
            return 1;
        }
        return 0;
    }

}

Now it is showing health gauze value = 1, as localhost service is up and running. I have 2 instances running locally.

I tried the below promql query

sum by (application) (avg_over_time(health{application="My Service"}[5m])) so my both instances are running it is giving 2 as value but I need the result to consider the application as a whole, i.e if both instances are running the result should be 1. And If one of the instances is down the result should be also 1. And if both instances are down promql query should return 0 as a result.

How to achieve this, Any help will be appreciated, Thanks In advance.

Solution

well, there are a few options:

I'll be assuming the health metric you're using is similar to the up metric usually present with the common Prometheus exporters. To the best of my understanding, the first option is the better one for your use case.

when service might constantly restart

is that case both instances might be restarting due to bad data or config and even if at any moment one would be up you'd want to know of such behavior, in that case calculating the average over time may be better but you'd want to get the max of the two instances to verify that at least one was running correctly: max by (application) (avg_over_time(health{application="My Service"}[5m]))

the complete rule example:

groups:
- name: MyServiceRuleGroup
  rules:
  - alert: ServiceUptimeIsCompromised
    expr: max by (application) (avg_over_time(health{application="My Service"}[5m])) < 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: service {{ $labels.application }} uptime has fallen below the threshold

the reason to use max over avg is to figure out whether at least one of the instances was up completely on its own (which is presumably what we want to check) if we'd like to know the general state of the application we'd use avg (theoretically only, this isn't the best option for that need) if you'd want to check a more immediate fluctuation you might reduce the time span to average the results (in the query itself)

for more permanent issues:

this query would make more use when the app doesn't have an auto-heal solution in place, for example if a process or server crashes and needs to be started up manually.

You can check sum by(application)(health{application="My Service"}) < 1

The complete rule example:

groups:
- name: MyServiceRuleGroup
  rules:
  - alert: ServiceUptimeIsCompromised
    expr: sum by(application)(health{application="My Service"}) < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: service {{ $labels.application }} uptime has fallen below the threshold

If you wish to be informed as the service was down you might change the groups[i].rules[j].for variable to 1m. Or you might keep it at 5m if you only want to be informed of prolonged downtime by an error or other unhealable issue