Search code examples
prometheusgrafanapromql

Graph Grafana Prometheus Un-Reset Counter Metrics


I have a Prometheus counter metric that represents a uniquely-named job status.

The metric unfortunately doesn't get reset after every entry which causes prometheus to have a time series of 1 (because of job unique name) for a long period of time as long as that job record exists.

I am trying to get the number of failed jobs (status='Failed') in a specified period of time(last 24 hours) using this prom query in Grafana:

sum (status_metric{status="Failed"}) by(status)

but because of the metric being 1 from before 24 hours, and not being reset to 0 after first report; jobs that reported their status before the specified range get added to the sum which is not intended to happen.

Question: How can I count occurrences of status events that started only in a specific range and exclude values from older events? Example: In the image below, job_1 and job_2 finished execution and reported status on Jan 1st but they are still reporting same status up until now(Jan 2nd 22:00:00) job_3 however finished/reported status on Jan 2nd at 5:00) . Goal is to get number of failed jobs starting from now-24hours (Jan 1st 22:00) to now(Jan 2nd 22:00) which must be equal to 1 only. not 3 that is with the assumption that only 1 job failed on Jan 2nd. status_metric over 2 days

Thanks


Solution

  • Issue is fixed by upgrading Grafana to latest version(9.3.6) which fixes the option of setting missing values to 0.