Search code examples
javaapache-flinkstreamingpyflink

Flink Batch vs Stream how they process real time data


I have read the documentation of streaming mode and batch mode. I assume that if I have an unbounded stream and I apply windows (like tumbling) on it it becomes a bounded stream? Please correct and stop me here if I am wrong. But that's what I understood from the documentation and this picture, which apparently represents a bounded stream can exist inside an unbounded stream, practically what windows (i.e Tumbling) does:

Bounded vs Unbounded

... In case that's right, If I am applying tumbling window on an unbounded input stream (thus converting it to bounded) and set it to be on batch mode, does the first window filling up have to reach the sink so the new incoming window can be read? For example, if window #1 is formed and it takes 5 seconds for each window to reach the sink, but before those 5 seconds at second 2 a window#2 would have already been ready at the source, does flink in batch mode will NOT read window#2 until window#1 has reached the sink?

In other words, which of these is correct?

  • Window#1 arrives at t=1, and until it reaches the sink at t=5 then window#2 is read at t=5, which means window#2 leaves the sink at t=10
  • Window#1 arrives at t=1, and as it progresses through the operators downstreams, window#2 arrives at t=2 and it also flows downstream, meaning the entire pipeline has windows 1 and 2 on different stages, but both present in the pipeline at the same time. Eventually window#1 leaves the sink at t=5 and window#2 around t=7

Option 1 Implies there can only be one window at a time anywhere in the pipeline. Option 2 implies there can be multiple windows in the pipeline (regardless of the operator they are at) at the same time


Solution

  • Allow me to try to clarify a few points:

    (1) A bounded stream can either be processed in batch mode or in streaming mode. Batch mode will be more efficient, because various optimizations can be applied if the Flink runtime knows that there's a finite amount of data to process. On the other hand, unbounded inputs can only be processed in streaming mode.

    (2) Applying windows does not convert an unbounded streaming job into a batch job. The fact that windowing is happening doesn't have any effect on how the job is executed -- it could be an unbounded streaming job, a bounded streaming job, or a bounded batch job, depending on whether the inputs are bounded or unbounded, and the choice of execution mode.

    (3) The various stages of the pipeline run independently of one another. There may be lots of windows open concurrently, especially if event time semantics are being used, and the event stream is significantly out-of-order.

    (4) Currently the entire job must be running in either batch mode or streaming job. It's not (yet) possible to mix and match the two execution modes.

    (5) Only sources built using the FLIP-27 interface support both batch and streaming, e.g., the KafkaSource (and not the legacy FlinkKafkaConsumer). As the docs explain:

    KafkaSource is designed to support both streaming and batch running mode. By default, the KafkaSource is set to run in streaming manner, thus never stops until Flink job fails or is cancelled. You can use setBounded(OffsetsInitializer) to specify stopping offsets and set the source running in batch mode. When all partitions have reached their stopping offsets, the source will exit.

    You can also set KafkaSource running in streaming mode, but still stop at the stopping offset by using setUnbounded(OffsetsInitializer). The source will exit when all partitions reach their specified stopping offset.