Search code examples
spring-bootprometheusgrafanaspring-boot-actuator

SpringBoot - observability on *_max *_count *_sum metrics


Small question regarding Spring Boot, some of the useful default metrics, and how to properly use them in Grafana please.

Currently with a Spring Boot 2.5.1+ (question applicable to 2.x.x.) with Actuator + Micrometer + Prometheus dependencies, there are lots of very handy default metrics that come out of the box.

I am seeing many many of them with pattern _max _count _sum.

Example, just to take a few:

spring_data_repository_invocations_seconds_max
spring_data_repository_invocations_seconds_count
spring_data_repository_invocations_seconds_sum

reactor_netty_http_client_data_received_bytes_max
reactor_netty_http_client_data_received_bytes_count
reactor_netty_http_client_data_received_bytes_sum

http_server_requests_seconds_max
http_server_requests_seconds_count
http_server_requests_seconds_sum

Unfortunately, I am not sure what to do with them, how to correctly use them, and feel like my ignorance makes me miss on some great application insights.

Searching on the web, I am seeing some using like this, to compute what seems to be an average with Grafana:

irate(http_server_requests_seconds::sum{exception="None", uri!~".*actuator.*"}[5m]) / irate(http_server_requests_seconds::count{exception="None", uri!~".*actuator.*"}[5m])

But Not sure if it is the correct way to use those.

May I ask what sort of queries are possible, usually used when dealing with metrics of type _max _count _sum please?

Thank you


Solution

  • UPD 2022/11: Recently I've had a chance to work with these metrics myself and I made a dashboard with everything I say in this answer and more. It's available on Github or Grafana.com. I hope this will be a good example of how you can use these metrics.

    Original answer:

    count and sum are generally used to calculate an average. count accumulates the number of times sum was increased, while sum holds the total value of something. Let's take http_server_requests_seconds for example:

    http_server_requests_seconds_sum   10
    http_server_requests_seconds_count 5
    

    With the example above one can say that there were 5 HTTP requests and their combined duration was 10 seconds. If you divide sum by count you'll get the average request duration of 2 seconds.

    Having these you can create at least two useful panels: average request duration (=average latency) and request rate.

    Request rate

    Using rate() or irate() function you can get how many there were requests per second:

    rate(http_server_requests_seconds_count[5m])
    

    rate() works in the following way:

    1. Prometheus takes samples from the given interval ([5m] in this example) and calculates difference between current timepoint (not necessarily now) and [5m] ago.
    2. The obtained value is then divided by the amount of seconds in the interval.

    Short interval will make the graph look like a saw (every fluctuation will be noticeable); long interval will make the line more smooth and slow in displaying changes.

    Average Request Duration

    You can proceed with

    http_server_requests_seconds_sum / http_server_requests_seconds_count
    

    but it is highly likely that you will only see a straight line on the graph. This is because values of those metrics grow too big with time and a really drastic change must occur for this query to show any difference. Because of this nature, it will be better to calculate average on interval samples of the data. Using increase() function you can get an approximate value of how the metric changed during the interval. Thus:

    increase(http_server_requests_seconds_sum[5m]) / increase(http_server_requests_seconds_count[5m])
    

    The value is approximate because under the hood increase() is rate() multiplied by [inverval]. The error is insignificant for fast-moving counters (such as the request rate), just be ready that there can be an increase of 2.5 requests.

    Aggregation and filtering

    If you already ran one of the queries above, you have noticed that there is not one line, but many. This is due to labels; each unique set of labels that the metric has is considered a separate time series. This can be fixed by using an aggregation function (like sum()). For example, you can aggregate request rate by instance:

    sum by(instance) (rate(http_server_requests_seconds_count[5m]))
    

    This will show you a line for each unique instance label. Now if you want to see only some and not all instances, you can do that with a filter. For example, to calculate a value just for nodeA instance:

    sum by(instance) (rate(http_server_requests_seconds_count{instance="nodeA"}[5m]))
    

    Read more about selectors here. With labels you can create any number of useful panels. Perhaps you'd like to calculate the percentage of exceptions, or their rate of occurrence, or perhaps a request rate by status code, you name it.

    Note on max

    From what I found on the web, max shows the maximum recorded value during some interval set in settings (default is 2 minutes if to trust the source). This is somewhat uncommon metric and whether it is useful is up to you. Since it is a Gauge (unlike sum and count it can go both up and down) you don't need extra functions (such as rate()) to see dynamics. Thus

    http_server_requests_seconds_max
    

    ... will show you the maximum request duration. You can augment this with aggregation functions (avg(), sum(), etc) and label filters to make it more useful.