Search code examples
apache-flink

flink checkpoint E2E duration too long


checkpoint screenshot

One machine takes a long time to checkpoint, but is about the same state size as the others, is this due to data drift or something else? (data is group by user)


Solution

  • Something is overwhelmed. To figure out where the problem is, look for backpressure delaying the arrival of checkpoint barrier to that subtask, or resource contention delaying the completion of the snapshot for that subtask.

    Asymmetry like this is often an indication of a hot key -- e.g., one user with a lot of events.