Search code examples
prometheusprometheus-pushgateway

Prometheus and Push Gateway how to keep state of incremental value over distributed processes


I am dealing with many short lived same jobs (many instances of the same process per hour), for which Prometheus does not have time to scrape, which is a valid use-case for the push gateway

My use case is that I want an error indication which count (of Gauge) these jobs.

As I understood, pushing a new value to the metric will override the previous one. And looking at the code for example in a python library for Gauge.inc() takes its value of the current process which is reset for each job run, hence, not providing a total count.

I see the following options to create a proper counter:

  • add a job_instance tag and sum when creating dashboards/alerts. The issue I see is that the metrics are not cleared so, running many jobs/instances will blow up the cache.
  • to overcome blowing up the cache, send delete requests periodically - this feels like a major hack
  • query the metric upfront and increment. Besides possible timing/concurrency and dependency issues, I did not found an endpoint exposing these.
  • use any other different approach

What would be the proper way to create a counter which can be counted over multiple same process?


Solution

  • use any other different approach

    Use prom-aggregation-gateway instead. It's tailer made for this kind of use case. From the README:

    According to https://prometheus.io/docs/practices/pushing/:

    The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever...

    The latter point is especially relevant when multiple instances of a job differentiate their metrics in the Pushgateway via an instance slabel or similar.

    This restriction makes the Prometheus pushgateway inappropriate for the use case of accepting metrics from a client-side web app, so we created this one to aggregate counters from multiple senders.

    Prom-aggregation-gateway presents a similar API but does not attempt to be a drop-in replacement.