Search code examples
alertmonitoringoutliersdatadoganomaly-detection

Creating HTTP code 500 alert using Datadog monitoring multiple systems in the same alert


So our server consists of users, and each user may select one of the 3rd party services we provide to communicate with.

Each 3rd party service has a different size of user population communicating with it through our system (and increasing):

  • Service (A) might has 30k users
  • Service (B) has 5k
  • Service (C) has 100k

We want to create an alert whenever any of these services are down (meaning monitoring 500s).

We send a metric from a central networking point in our code when 500 occurs, includes the url of the service as a tag.

A couple of constraints:

  1. We prefer to create just one monitor that catches all and reports each service individually (so if service A and B are down, we get 2 alerts). We don't want to create multiple monitors for the same purpose to monitor different services (and maybe create a composite monitor) because the services we communicate with might increase in the future.

  2. We don't want to explicitly set a threshold on the number of 500s on the single monitor we create, above which the monitor sends an alert, because each service has a different size of user population, so 10 occurrences in 10 mins of 500 for Service (C) (has 100k) shouldn't be considered as service down, compared to Service (B) (has 5k).

I thought of using Outlier or Anomaly monitors but we're trying to figure out the best configuration for it to avoid any false positives. So changing the Outlier algorithm between DBSCAN and MAD sometimes yield nothing and changing the tolerance yields false positives.

This is with DBSCAN, tolerance 3.0 - the big spike is not detected enter image description here

tolerances till 1.0 detects nothing, but 0.5 detects everything, which might be false positives enter image description here

Same behavior with MAD algorithms , there's no specific tolerance to catch the correct values

Any recommendations regarding the configuration above is welcome, or even if you think there should be a different kind of a monitor used.


Solution

    1. You could attach a service tag to the metrics and create a Multi Alert monitor to alert for each service that meets the threshold.

    A Multi Alert monitor triggers individual notifications for each entity in a monitor that meets the alert threshold.
    For example, when setting up a monitor to notify you if the P99 latency, aggregated by service, exceeds a certain threshold, you would receive a separate alert for each individual service whose P99 latency exceeded the alert threshold.

    https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#alert-grouping

    1. If you have the "number of users per service" count in Datadog as a metric, you could normalize the error count relative to the user count and setup thresholds based on that value.
      Example:
      Service A(100 users) - 10 errors/100 users = 0.1.
      Service B(2000 users) - 10 errors/2000 users = 0.005.
      So, if you set a threshold of >= 0.1, Service A would alert when there are 10 or more errors and service B would alert only when there are 200 or more errors.