Im still struggeling about how flink "exchanges/transffers" data between different operators and what happens with the actual data between the operators.
Take the example DAG above: DAG of execution
The DataSet gets forwarded/transferred to all parallel instances of the GroupReduce Operator, the Data gets reduced according the the GroupReduce Transformation.
All of the new Data gets forwarded to the Filter->Map->Map Operand i.e all data consumed by one of the parallel instances of the GroupReduce operator is transferred to exactly one instance of the Filter->Map->Map operator (Without need for serialization/deserialization, so the Operator accesses the data generated by the GroupReduce Operator)
All of the GroupReduces Output Data gets hashed and evenly distributed/transferred among all the parallel instances of the (Filter->Map) Operator (serialization/deserialization needed between the operators)
So if for example the the GroupReduce Operators output is about 100MB, it will forward 100MB to the (Filter->Map->Map) Operand and hashes a copy of that 100MB and transferr it to the (Filter->Map) Instances. So I will gernerate another 100MB of netwerk traffic
Im quite confused why there is so much network traffic after the GroupReduce and before the Filter Steps. Wouldnt it be better to chain the GroupRedcue and Filter steps together before sending the now reduced data to subsequent operators ?
The GroupReduce function is the same as using a combiner from MapReduce programming model.
Partial computation can significantly improve the performance of a GroupReduceFunction. This technique is also known as applying a Combiner. Implement the GroupCombineFunction interface to enable partial computations, i.e., a combiner for this GroupReduceFunction.
So, after a combiner there is always a shuffle phase/partition that connects all upstream operators to all downstream operators. Check this answer to clarify what is a combiner.