So,keyBy
or groupBy
causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.
For an example, if I run the following operators:
map(Mapper1).keyBy(0).map(Mapper2)
with a parallelism of 2, I would get something like this:
Mapper1(1) -\-/- Mapper2(1)
X
Mapper1(2) -/-\- Mapper2(2)
And in the end all records with the same key within the Mapper1
are assigned to the same partition in Mapper2
.
My question is:
I want to know what happens during the keyBy
or groupBy
in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy
or groupBy
with an another operation ?
Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.
Thank you !
So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout
is a parameter on the job-level which can be configured via the StreamExecutionEnvironment
and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.
Also the following links are really helpful to understand the details:
https://flink.apache.org/2019/06/05/flink-network-stack.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html