monitoring apache-flink metrics flink-streaming latency

Latency Monitoring in Flink application

I'm looking for help regarding latency monitoring (flink 1.8.0).

Let's say I have a simple streaming data flow with the following operators: FlinkKafkaConsumer -> Map -> print.

In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity? I want to get the duration of processing input received in the source until it received by the sink/finished sink operation.

I've added my code: env.getConfig().setLatencyTrackingInterval(100);

And then, the following latency metrics are available:

But I don't understand what exactly they are measuring? Also latency avg values are not seem to be related to latency as I see it.

I've tried also to use codahale metrics to get duration of some methods but it's not helping me to get a latency of record that processed in my whole pipeline.

Is the solution related to LatencyMarker? If yes, how can I reach it in my sink operation in order to retrieve it?

Thanks, Roey.

Solution

-- copying my answer from the mailing list for future reference

Hi Roey,

with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity).

LatencyMarkers are injected periodicaly in the sources and are flowing through the topology. They can not overtake regular records. LatencyMarkers pass through function (user code) without any delay. This means the latencies measured by latency tracking will only reflect a part of the end-to-end latency, in particular in non-backpressure scenarios. In backpressure scenarios latency markers will queue up before the slowest operator (as they can not overtake records) and the latency will better reflect the real latency in the pipeline. In my opinion, latency markers are not the right tool to measure the "user-facing/end-to-end latency" in a Flink application. For me this is a debugging tool to find sources of latency or congested channels.

I suggest, that instead of using latency tracking you add a histogram metric in the sink operator yourself, which depicts the difference between the current processing time and the event time to get a distribution of the event time lag at the source. If you do the same in the source (and any other points of interests) you will get a good picture of how the even-time lag changes over time.

Hope this helps.

Cheers,

Konstantin