Search code examples
prometheusprometheus-alertmanagerprometheus-pushgateway

Prometheus alerts based on non-periodic metrics from batch jobs


I have a CronJob which runs every 20 minutes and collects only active builds on Jenkins multibranch jobs whose build duration has exceeded certain value and publishes these metrics with build duration as value to Prometheus Pushgateway. So this metric will not be pushed if the build is completed. I have alerts set up using below Prometheus configuration.

- alert: BuildDurationExceeded
    expr: jenkins_build_duration > 7200
    annotations:
      title: 'Build duration is too long at {{ $labels.instance }}'
      description: 'Build time of job {{ $labels.job }} on {{ $labels.instance }} exceeded 2h.'
    labels:
      severity: 'high'

Below is one metric value that is present in Prometheus

jenkins_build_duration{branch="repo/branch_name",build_number="5",instance="https://jenkins-instance.com/",jenkins_url="https://jenkins-instance.com/job/repo/job/branch_name/5/",job="jenkins_metrics_job",job_name="repo"}    10000

With this configuration, once an alert is created it is retained forever and does not get deleted. How can I handle this use case when the metric is not periodic for a given job and delete the alert when the metric is no longer available?


Solution

  • We fixed this by deleting the metric explicitly if it is no longer relevant using Prometheus python client library API delete_from_gateway.