Search code examples
apache-flinkflink-cepdatadog

How to I relate the metrics in Datadog with execution plan operators in Flink?


In my case scenario, Flink is sending the metrics to Datadog. Datadog Host map is as shown below { I have no Idea why is showing me latency here }

enter image description here

Flink metrics are sent to localhost. The issue here is that when

flink-conf.yaml file configuration is as follows

    # adding metrics

metrics.reporters: stsd , dghttp
metrics.reporter.stsd.class: org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.stsd.host: localhost
metrics.reporter.stsd.port: 8125

#  for datadog
metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: xxx
metrics.reporter.dghttp.tags:  host:localhost, job_id : jobA , tm_id : task1 , operator_name : operator1

metrics.scope.operator: numRecordsIn
metrics.scope.operator : numRecordsInPerSecond
metrics.scope.operator : numRecordsOut
metrics.scope.operator : numRecordsOutPerSecond
metrics.scope.operator : latency

The issue is that Datadog is showing 163 metrics which I don't understand, which I will explain in a while

enter image description here

I don't understand the metrics format in datadog as it shows me metrics something like this

enter image description here

Now as shown in above Image

  1. Latency is expressed in time
  2. Number of events per second is event /sec
  3. count is some value

So my question is that which metric is this?

Also, the execution plan of my job is something like this

How do I relate the metrics in Datadog with execution plan operators in Flink?

enter image description here

I have read in Flink API 1.3.2 that I can use tags, I have tried to use them in flink-conf.yaml file but I don't have complete Idea what sense they make here.

My ultimate goal is to find operator latency, number of records out and in /second at each operator in this case


Solution

  • There are a variety of issues here.

    1. You've misconfigured the scope formats. (metrics.scope.operator)

    For one the configuration doesn't make sense since you specify "metrics.scope.operator" multiple times; only the last config entry is honored.

    Second, and more importantly, you have misunderstood for scope formats are used for.

    Scope formats configure which context information (like the ID of the task) is included in the reported metric's name.

    By setting it to a constant ("latency") you've told Flink to not include anything. As a result, the numRecordsIn metrics for every operator is reported as "latency.numRecordsIn".

    I suggest to just remove your scope configuration.

    2. You've misconfigured the Datadog Tags

    I do not understand what you were trying to do with your tags configuration.

    The tags configuration option can only be used to provide global tags, i.e. tags that are attached to every single metrics, like "Flink".

    By default every metric that the Datadog reports has tags attached to it for every available scope variable available.

    So, if you have an operator name A, then the numRecordsIn metric will be reported with a tag "operator_name:A".

    Again, I would suggest to just remove your configuration.