Search code examples
monitoringbosun

Check if a process is running on Bosun


I'm testing Bosun (open-source monitoring and alerting system by Stack Exchange) and I'm quite confused about how to monitor "boolean" metrics.

I would like to get alerted if some process is not running.

To collect the metric and I've tried 2 ways of doing it:

  • In the documentation of scollector I see that some processes can be configured I don't receive any related metric. Do I need any special configuration for enabling those processes checks?

  • I've created a custom collector to count those processes.

For getting alerted, I created the following rule:

alert test {
  template = test
  crit = avg(q("avg:myprocess.running{host=*}", "10m", "")) < 1
}

Is this the proper way of doing it or is there a better way?


Solution

  • Options

    1. If you have an alert and are using OpenTSDB, when a tagset "disappears" (no data for 2x the checkduration) the alert will go unknown. Then you could treat this unknown to mean "Down".
    2. If the metric gets sent regardless of it being up and down (i.e. there will always be a 0 or a 1 you can alert on that. The only thing here is that avg doesn't really make a whole lot of sense (unless you are doing fuzzy logic). So you probably want to use either last, max or min.

    Conf

    The scollector conf goes on each host. The configuration lines should be as specified in that documentation link you specified. Also keep in mind that your example alert has no warnNotification or critNotification, so it will only be on the dashboard (no emails or http posts will be set).

    Tagsets and the OpenTSDB query

    It is import to understand that first argument in "avg:myprocess.running{host=*}". So avg means to take all the tags that you did not specify and average them out. So for instance if you also had an ID tag like our scollector ones you might want to do sum in the query string instead of avg, and alert if there is less than one process.