Search code examples
bigdatastreamingapache-flinkflink-streamingdata-stream

Streaming data processing joining with different two latency


We have two transactions, but we need to configure them for future cases. I'm curious about your thoughts on this process. (I'm newbie to streaming data) We have Flink and KStreams environment. These two transactions have two different latency.

  1. If we do not have a limit for the latency, how can we ensure the completeness of data in the output stream?
  2. If we know that there is a 60 seconds maximum latency, and there is a constraint that we cannot hold the objects in memory, how could we ensure completeness of data in the output stream?

Solution

  • In Flink, your WatermarkStrategy is responsible for managing the tradeoff between completeness and latency. With a longer watermark delay you can be more confident of operating on complete data, at the cost of additional latency.

    ... and there is a constraint that we cannot hold the objects in memory

    Ensuring completeness of the results in this situation depends on what you are doing. If, for example, you are computing windowed analytics, then you can use incremental aggregation of the window results to limit the state you are keeping to a single value. So long as the watermarking is correct (meaning you avoid having any late data), then your results will be complete.

    (And for what it's worth, Flink is also able to spill state to disk when using RocksDB rather than the heap for its state backend.)