monitoring prometheus promql kubernetes-cronjob

Prometheus query for last local peak value

What Prometheus query (PromQl) can be used to identify the last local peak value in the last X minutes in a graph?

A local peak is a point that is larger than its previous and next datapoint. (So the current time is definitely not a local peak)

(p: peak point, i: cornjob interval, m: missed execuation)

I want this value to find an anomaly in the execution of a cron job. As you can see in the picture, I have written a query to calculate the elapsed time since the last execution of a job. Now to set an alert rule to calculate the elapsed time from the last successful execution and find missed execution, I need the amount of time that the last execution of the job occurred in that interval. This interval is unknown for the query (In other words, the interval of the job is specified by another program), so I can not compare elapsed time with a fixed time.

Solution

Use z-score to detecting anomalies

If you know the average value and standard deviation (σ) of a series, you can use any sample in the series to calculate the z-score. The z-score is measured in the number of standard deviations from the mean. So a z-score of 0 would mean the z-score is identical to the mean in a data set with a normal distribution, while a z-score of 1 is 1.0 σ from the mean, etc.

Calculate the average and standard deviation for the metric using data with large sample size.

# Long-term average value for the series
- record: job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
expr: avg_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])

# Long-term standard deviation for the series
- record: job:cronjob_duration_time_seconds_count:rate5m:stddev_over_time_1w
expr: stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])

calculate the z-score for the Prometheus query once you have the average and standard deviation for the aggregation.

# Z-Score for aggregation
(
job:cronjob_duration_time_seconds_count:rate10m -
job:cronjob_duration_time_seconds_count:rate10m:avg_over_time_1w
) /  stddev_over_time(sum(rate(cronjob_duration_time_seconds_count[10m]))[1w:])

Based on the statistical principles of normal distributions, you can assume that any value that falls outside of the range of roughly +1 to -1 is an anomaly. For example, you can get an alert when our aggregation is out of this range for more than five minutes.