I have a blackbox exporter that checks some HTTP endpoints. I've noticed that it doesn't use (rightly) histograms, so I was wondering what's the best way to calculate SLAs for each endpoint?
For instance let's say I check http://google.com, I'd like to calculate: - the percentage of times I received a valid response (probe_success) - the percentage of times the response was fetched within X milliseconds
I've tried using avg_over_time:
avg_over_time(probe_success{target="https://google.com"}[30d]
and dividing by the count of the same metric but I know it's wrong and something's missing
avg_over_time(probe_success[1d])
will give you a ratio between 0 (0%) and 1 (100%). So if you want a percentage out of it, multiply by 100. Or set it up as such in Grafana (I believe it's called "percent (0.0 - 1.0)" or something like that.
If OTOH you want a percentile for some metric, say 90th percentile memory utilization, you'd use something like quantile_over_time(0.9, memory_utilization[1d])
.