Search code examples
prometheusopen-telemetry

How to differentiate metric values from different instances of the same service?


I have couple of services deployed on Kubernetes. Some of NodeJS based, others are Java based. In the cluster there's OTEL Collector deployed which then provides data for Prometheus. Grafana is used for dashboarding. For Java I'm using -javaagent:/jars/opentelemetry-javaagent.jar and for NodeJS simple tracing file such as:

const sdk = new NodeSDK({
    // Service name is configured by OTEL_SERVICE_NAME
    traceExporter: new OTLPTraceExporter(),
    metricReader: new PeriodicExportingMetricReader({
        exporter: new OTLPMetricExporter(),
        exportIntervalMillis: 5000,
    }),
    instrumentations: [getNodeAutoInstrumentations()], // will contain https://www.npmjs.com/package/@opentelemetry/instrumentation-http
});

Rest of OTEL config is defined in ENVs (traces configuration is omitted for readability):

OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_METRICS_EXPORTER=otlp
OTEL_SERVICE_NAME=[service name]
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector-listens-here:4317

Apps are deployed on Kubernetes, with 2 or more pods each. And I think this is the problem why I'm getting strange results for http_server_duration_milliseconds_count metric. See examples:

  1. Service A with 5 pods running:

example chart for metric

  1. Service B with 2 pods running: example chart for metric oscillating between 0 and some high value

  2. Service C with 3 pods running: enter image description here

Available labels for those metrics are:

http_flavor 
http_method 
http_route 
http_scheme
http_status_code 
job 
net_host_name 
net_host_port 
net_protocol_name 
net_protocol_version

Is my assumption correct, that there's no way to differentiate pods and those metrics are treated as coming from one source? I'm thinking like ServiceA#pod1 exports value 1, then ServiceA#pod2 (which got more requests) exports 12 and after that ServiceA@pod1 exports 3 (as it got 2 new requests) and so on?

If so, what's the best solution to solve this?

  • I could probably use net_host_ip which I would expect to be set to pod IP, but this attribute isn't set automatically in Java and NodeJS based instrumentation.
  • Or maybe I should add label like k8s_pod_name or something to differentiate the pods?
  • Also service.instance.id seems like "native" solution to my problem, but it's experimental state

Any suggestions or clarifications will be much appreciated :)


Solution

  • This is the intended use case for service.instance.id. Experimental in the OpenTelemetry specification unfortunately doesn't indicate how experimental, or stable, something is.

    Per the docs:

    Signals start as experimental, which covers alpha, beta, and release candidate versions of the signal.

    service.instance.id is likely safe to rely on due to the how important it is for use cases like you shared (identifying different k8s pods, for example). The definition of how to best generate this ID could change, however, but it's intended to be an opaque value used to compare the behavior of instances.