spark-streaming apache-flink flink-streaming

Network shuffle in streaming

So,keyBy or groupBy causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.

For an example, if I run the following operators:

map(Mapper1).keyBy(0).map(Mapper2)

with a parallelism of 2, I would get something like this:

Mapper1(1) -\-/- Mapper2(1)
             X
Mapper1(2) -/-\- Mapper2(2)

And in the end all records with the same key within the Mapper1 are assigned to the same partition in Mapper2.

My question is:

I want to know what happens during the keyBy or groupBy in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy or groupBy with an another operation ?

Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.

Thank you !

Solution

So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout is a parameter on the job-level which can be configured via the StreamExecutionEnvironment and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.

Also the following links are really helpful to understand the details:

https://flink.apache.org/2019/06/05/flink-network-stack.html

https://flink.apache.org/2019/07/23/flink-network-stack-2.html