I am using spring batch (4.2.2.RELEASE) together with the spring actuator (2.2.6 RELEASE). Since version 4.2, spring batch provides support for batch monitoring and metrics based on micrometer (https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/monitoring-and-metrics.html).
For example i am able to see with the metric name spring_batch_job
how often a job was executed, its status and duration.
I want to monitor this metric with grafana & prometheus and alert if a job failed in the last xx minutes.
If the spring batch application runs as a service it seems that it sums up all the metrics until the service is stopped. For example if a job was started 12 times in the last hour the metrics output could be the following:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 10.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 354.354538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
So two instances of the mainJob
failed. Assumed in the next hour all 12 jobs will be successful, the metrics output would be:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 22.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 708.704538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
How am i able to check if a job failed in the last xx minutes? Because the following expression would still return the two failed job instances: spring_batch_job_seconds_count{status="FAILED"}[15m]
I'm not familiar with Prometheus QL but I will try to help.
What you can do is to calculate the difference of this counter between the last hour and the hour before. If you see an increase in the number of failed instances, then at least one instance has failed and you can raise an alert. Otherwise, no job has failed in the previous hour.
Prometheus provides the increase function that is designed specifically for that. So you should be able to answer your question and raise an alert when:
increase(spring_batch_job_seconds_count{name="mainJob",status="FAILED"}[15m]) > 0
As I said, I'm not expert at Prometheus, so I will let you check the syntax. But that's the idea.