Search code examples
apache-flink

Can checkpoint be completed in which task is down in Flink?


I am using the flink. I have five task manager and each task manager has one task slot each.

The source is Kafka and after reading data and processing some logics, the data is sinked to the dynamodb.

Note that I use the chaining all the operators so that source, process and sink are combined together into a single operator.

And also, I use the checkpoint with exactly once semantics with 5 minute time interval.

In such scenario, the kafka has rolling updates and the coordinator has changed.

During the update, the flink has warning message showing that the offset is failed to source kafka.

From this scenario, I have following questions.

  1. Is it possible for checkpoint to be completed successful? Because when I saw the flink dashboard, there are no failed checkpoint in the dashboard.

  2. If so, can it be possible that task is down but the taskmanger is not down, so that it makes the checkpoint is completed? (task down is trivial issue so it does not affect the checkpoint's made)

  3. When chaining together, the data loss can be possible even though we use the checkpoint? I mean that since the operators are combined together from source to sink, when the source is affected by external kafka, the sink will be affected because of the sink and chaining. At that time, the task is going to down and in-flight data is going to be discarded ?


Solution

    1. Is it possible for checkpoint to be completed successful? Because when I saw the flink dashboard, there are no failed checkpoint in the dashboard.

    Yes. The msg you saw is when the Kafka consumer being used by Flink tries to save offsets to the Kafka cluster, but this is not used by checkpointing. Flink saves offsets to snapshots, and will use them when restoring from a checkpoint or savepoint, so it doesn't need offsets that are saved to the Kafka cluster.

    1. If so, can it be possible that task is down but the taskmanger is not down, so that it makes the checkpoint is completed [snip]

    I assume you're asking whether the job can be in a failed state, but the Task Manager is still running. Yes, but that has nothing to do with your first question (it's not why a checkpoint could be completed even though you see errors).

    1. When chaining together, the data loss can be possible even though we use the checkpoint? I mean that since the operators are combined together from source to sink, when the source is affected by external kafka, the sink will be affected because of the sink and chaining. At that time, the task is going to down and in-flight data is going to be discarded?

    You always "lose" in-flight data when a job fails. But because Flink saves the offsets for all Kafka partitions as part of the last successful checkpoint, when the job is restarted from that checkpoint, you will wind up replaying all of the in-flight data. So there is no data loss.