How to count total records read in source using Flink dataset API

We currently use Flink DataSet API to do read files from FileSystem and apply some batch transformations. We also want to obtain the total records processed after when the job is finished. The pipeline is like dataset.map().filter()

count() function seems to be a non-parallel operator and it needs an extra computation from the all dataset.

Is there any approaches to count processed records in the map operator and give a side output like streaming so we can aggregate them to get the total count? Or any other better way to do that?

Thank you very much!

Solution

You probably want to use counters. These counters allow you to output small statistics for each task that get accumulated automatically when the job finishes.

What are the benefits of Apache Beam over Spark/Flink for batch processing?
What is/are the main difference(s) between Flink and Storm?
FLINK - will SQL window flush the element on regular interval for processing
Difference between job, task and subtask in flink
Flink failed to deserialize JSON produced by Debezium
Flink serialization of java.util.List and java.util.Map
Flink webUI - GC time
Where the Upsert Kafka connector consumer start?
The implementation of the AbstractRichFunction is not serializable when using JDBC Sink in Flink
Flink standalone mode takes too long to start
Limiting the state size in flink
Immediate CEP Event Trigger Issue with WatermarkStrategy in Flink 1.16.1
Connect a stream with watermarks with another one without watermarks in Flink
Read a keyed Kafka Record using apache Flink?
Error in Flink process Kafka topic:java.net.ConnectException: Connection refused (Connection refused)
Apache Flink with multiple Kafka sources. Ensure one topic is fully read before consuming data on the other topic
Flink user defined sink connector can not serialize data into JSON format
Using Spring with Apache Flink - Command line arguments are not available to Spring
Is there any chance to limit database sessions using jdbc sinks with apache flink?
Flink GlobalWindow Trigger only process the trigger event
Why does Flink Table with Kafka Connector not return results for window-based aggregation operations?
Dependency management and execution environment in apache flink
The POJO class passes the test ,but shows invalid during execution
Flink KeyedProcessFunction Creation Count
Apache Flink Python Datastream API sink to Parquet
Unable to use s3-fs-hadoop plugin in Kubernetes
Build a JSON_Object value in Flink SQL
Kafka Migration with MM2 and Flink: How to Handle Offset Changes and Savepoints?
Performance difference between Table- and DataStream-API
Apache Flink: restoring state from checkpoint with changes Kafka topic