Search code examples
apache-flinkflink-streamingfault-toleranceexactly-once

Exactly-once: who is storing the historical data, flink or the data source


I've known that Apache Flink have the capacity of Exactly once, which relies on the checkpoint mechanism and the resendable data source.

As my understanding, if an operator of Flink gets some error, it needs to make its last operation to run again, so it must need to get the historical data. In this case, where should/could the historical data be stored?

Saying that the data source is Apache Kafka, so can I let Kafka store the historical data? Can I let Flink store the historical data? Or can I let both of them do so? If both of them can do this thing together, does it mean that I can let Kafka store one part of historical data, let Flink store the other part of historical data so that I can save more historical data?


Solution

  • Flink follows the dataflow approach for stream processing. Each operator processes some of the elements and sends them downstream as soon as they are processed.

    There are special markers generated at the sources which are called checkpoint markers. When they reach the operator, it checkpoints the state and sends the marker downstream.

    The channels used to send data between operators are also durable. So, whenever an operator fails it just needs to replay the records in the channel which were sent by the last operator from the last successful checkpoint. Since the inter-operator channels are durable (stores records by design) and guarantee FIFO, you don't need to store them manually anywhere. (I'm yet to find the details of how they do it.)

    If you are using Kafka as a source, Flink takes care of the exactly-once semantics there as well. (Due to the durable nature of Kafka, the records are stored and can be read again.)

    You just need to guarantee that your sink is either Idempotent or supports two-phase commits for exactly-once semantics.