flink checkpoint E2E duration too long

checkpoint screenshot

One machine takes a long time to checkpoint, but is about the same state size as the others, is this due to data drift or something else? (data is group by user)

Solution

Something is overwhelmed. To figure out where the problem is, look for backpressure delaying the arrival of checkpoint barrier to that subtask, or resource contention delaying the completion of the snapshot for that subtask.

Asymmetry like this is often an indication of a hot key -- e.g., one user with a lot of events.

What are the benefits of Apache Beam over Spark/Flink for batch processing?
What is/are the main difference(s) between Flink and Storm?
FLINK - will SQL window flush the element on regular interval for processing
Difference between job, task and subtask in flink
Flink failed to deserialize JSON produced by Debezium
Flink serialization of java.util.List and java.util.Map
Flink webUI - GC time
Where the Upsert Kafka connector consumer start?
The implementation of the AbstractRichFunction is not serializable when using JDBC Sink in Flink
Flink standalone mode takes too long to start
Limiting the state size in flink
Immediate CEP Event Trigger Issue with WatermarkStrategy in Flink 1.16.1
Connect a stream with watermarks with another one without watermarks in Flink
Read a keyed Kafka Record using apache Flink?
Error in Flink process Kafka topic:java.net.ConnectException: Connection refused (Connection refused)
Apache Flink with multiple Kafka sources. Ensure one topic is fully read before consuming data on the other topic
Flink user defined sink connector can not serialize data into JSON format
Using Spring with Apache Flink - Command line arguments are not available to Spring
Is there any chance to limit database sessions using jdbc sinks with apache flink?
Flink GlobalWindow Trigger only process the trigger event
Why does Flink Table with Kafka Connector not return results for window-based aggregation operations?
Dependency management and execution environment in apache flink
The POJO class passes the test ,but shows invalid during execution
Flink KeyedProcessFunction Creation Count
Apache Flink Python Datastream API sink to Parquet
Unable to use s3-fs-hadoop plugin in Kubernetes
Build a JSON_Object value in Flink SQL
Kafka Migration with MM2 and Flink: How to Handle Offset Changes and Savepoints?
Performance difference between Table- and DataStream-API
Apache Flink: restoring state from checkpoint with changes Kafka topic