Search code examples
apache-flink

How to count total records read in source using Flink dataset API


We currently use Flink DataSet API to do read files from FileSystem and apply some batch transformations. We also want to obtain the total records processed after when the job is finished. The pipeline is like dataset.map().filter()

count() function seems to be a non-parallel operator and it needs an extra computation from the all dataset.

Is there any approaches to count processed records in the map operator and give a side output like streaming so we can aggregate them to get the total count? Or any other better way to do that?

Thank you very much!


Solution

  • You probably want to use counters. These counters allow you to output small statistics for each task that get accumulated automatically when the job finishes.