Search code examples
prometheusgrafanamicrometerspring-micrometer

Micrometer/Prometheus: How to measure processing duration?


I want to measure how long it takes to process some data: My application reads that data from a given source at a fixed rate. Before each circle I store Instant.now(). I read the data, adding that single timestamp to each entry. The data gets stored, converted, and just before I send it out via WebSockets, I want to measure the duration between now() and that initial timestamp.

I tried

long millis = Duration.between(dataEntry.getReceivedTimestamp(), Instant.now()).toMillis();
LOG.info(millis + "ms");
registry.timer("processingDuration").record(millis, TimeUnit.MILLISECONDS);

but visualizing this allows me to use only processingDuration_seconds_count, _max and _sum. count and sum increase over time (of course), max is constant most of the time. So how do I see higher and lower plateaus of load? I tried irate(processingDuration_seconds_sum[10m]) to see jumps at least, but as irate() uses two datapoints only, I still cannot identify longer periods of high load easily. Plus: The value is around 0.6 in the graph, while the logged ms are around 5-10, so I'm losing the actual value here.

So I tried to use a Gauge instead – which is supposed to allow both increasing and decreasing values:

registry.gauge("processingDurationGauge", millis);

I'd thought this would go up and down within the range of the logged ms, but it's constantly 92.

How do I measure throughout time of data?


Solution

  • Using a timer and record is the correct solution.

    long millis = Duration.between(dataEntry.getReceivedTimestamp(), Instant.now()).toMillis();
    LOG.info(millis + "ms");
    registry.timer("processingDuration").record(millis, TimeUnit.MILLISECONDS);
    

    Assuming you are scraping every 30 seconds you can us the _sum and _count to get the average recorded duration:

    increase(processingDuration_seconds_sum[1m])/increase(processingDuration_seconds_count[1m])
    

    And if you wanted to compare how the current duration is behaving compared to the average over the last day:

    ((increase(processingDuration_seconds_sum[1m])/
    increase(processingDuration_seconds_count[1m]))*1.50) >
    increase(processingDuration_seconds_sum[24h])/
    increase(processingDuration_seconds_count[24h])
    

    That would only return values where the 1m average is more that 1.5x the daily average. (I haven't tested that query, but it should get the idea).