Search code examples
prometheusgrafanacodahale-metrics

Metric naming convention for orchestrator APIs


I have a web-service with an endpoint POST /coffee This takes in a body that has some supported flags:

  1. sugar (adds sugar to the coffee)
  2. salt (a pinch of salt... )
  3. caramel (adds sugar to the coffee)
  4. chocolate (adds chocolate)
  5. milk (adds milk)
  6. whipped cream (adds whipped cream)

There could be more as we expand. But the point is that you can make any such combinations of coffee, including no flags (espresso perhaps).

Each of these flags would correspond to a separate operation (assume an API call to another service) - hence I would want to measure that.

What is the best way to go about measuring these metrics? My concern here is on the metric names. So there are 2 approaches that I can think of.

Option 1: Unique metric for every combination

Pro: I get to know specifically the latency and errors on a specific combination/pattern of calls.

Con: Not maintainable. Metric name patterns will explode in number as new flags are introduced. Reading all these on a dashboard is hard.

Option 2: One metric only for "POST /coffee" that will include any combination.

Pro: Maintainable. Dashboards are sane.

Con: Will not be able to disect the slowness on a specific combination. I can probably have separate metrics against the calling services but I won't be able to associate that metric to any specific combination of flags.

Maybe using labels in metrics? Dropwizard-metrics does not support labels, but I don't know if that also is a good option (label explosion in this case.)


Solution

  • Like I start explaining on the comments, i would use coffee_ingredients_count with one single label ingredient. Then I increment (or create if it does not exists) the total of ingredients. It will grow the number of labels, you are right, but it is only one type of label. At some point it will stop to grow. Then comes the problem that is a counter. I would measure the rate(coffee_ingredients_count[5m]) so I see what happens on the las 5 minutes.

    would that mean you would consider the addition of 4 ingredients to coffee equivalent to 4 calls to making an espresso (no ingredients)?

    Yes. Because the counter is per ingredients.

    Then comes the problem that you said.

    Con: Will not be able to disect the slowness on a specific combination

    I am not sure about the slowness but you can still count the ingredients. You can do the opposit of what the rate function is doing in order to get the real values of each metric. In a perfectly ideal in our situation, the opposite calculation is also true: rate(coffee_ingredients_count[5m]) * 60 = real_value However, this opposite calculation is not always true in the cases where some samples are not covering the full range ideally, or when samples do not line up perfectly due to random delays introduced between scrapes reference.