I have metric data being pulled from telegraf to prometheus, and built a dashbboard with prometheus metric. I am trying to find the query which would give me downtime percentage. The formula that I use is Downtime percentage = (No. of seconds the status has been success/Total no of seconds in a day)*100
My metric data looks something like below, Query: test_jobevent_status{logname="123_abc",instance="job123"} output: 0-success or 1-failure
So, downtime percentage is the number of seconds test_jobevent_status is 2. Scrape interval that we have is 15s. So, it would be okay to consider the last state at any second within those 15 secs.
Could someone please help me out in writing a query to find out the sum of seconds(or mins) when the jobevent's status was in failing state?
FWIW, summarize, sumSeries and group were helpful in doing the same in graphite. But not sure what should be helpful in getting the same in prometheus.
Try the following query:
100-100*avg_over_time(test_jobevent_status{logname="123_abc",instance="job123"}[1d])