I'm using Hystrix, micrometer, prometheus.
The following query works, but I need to modify it and I don't know how:
sum by(group, key) (increase(hystrix_execution_total{event="exception_thrown"}[1m])) / sum by(group, key) (increase(hystrix_execution_terminal_total[1m])) * 100 >= 5
Basically exception_thrown
is one event emitted by Hystrix, but Hystrix also emits another metric bad_request
which signifies a client error (400). To get an accurate measure of upstream server errors, I need to subtract two metrics, which have the same labels, but different values, and then get a rate per minute to alert on
hystrix_execution_total{job="auth",key="authenticate",event="exception_thrown"} - hystrix_execution_total{job="auth",key="authenticate",event="bad_request"}
ignoring worked once someone pointed out that there was more than 1 difference between the label values.
Not only was the event label value different, but there was a corresponding difference in the terminal label value. For exception_thrown, terminal=true, for bad_request, terminal=false[!
sum by(group, key) (increase(hystrix_execution_total{job="auth",key="authenticate",event="exception_thrown"}[10m]) - ignoring(event, terminal) increase(hystrix_execution_total{job="auth",key="authenticate",event="bad_request"}[10m])) / sum by(group, key) (increase(hystrix_execution_terminal_total[10m])) * 100 >= 5