Search code examples
bigdataapache-flinkflink-streaming

Apache Flink: How to store intermedia data in streaming application


I am implementing the MisraGries algorithm with Flink's DataStream API. It keeps k counters to record the data summary by increment or decrement.

What is the best approach to store such counters when using DataStream API to implement the algorithm? Now I just declared a HashMap variable in the operator. Is this the right approach or do I need to use some other features like state?


Solution

  • You should store the counters in Flink's managed state, i.e., either keyed state or operator state and enable checkpointing. Otherwise, the information will be lost in case of a failure.

    If state is correctly used and checkpointing is enabled, Flink periodically checkpoints the state of an application. In case of a failure, the job is restarted and its state is reset to the latest checkpoint.