Search code examples
apache-flinkshuffle

Does Flink keyby on the same field which isn't changed cause a shuffle?


dataStream.map(func1).keyBy("key") //(1)
  .process(func2).keyBy("key")     //(2)
  .timeWindow().aggregate(func3).addSink(sink)

Method process() doesn't change the field(key) value of records. Given that the parallelism of all operators is 2, does keyBy() at (2) also result in network shuffle? Maybe keyBy() at (2) has the effect of forward strategy avoiding network communication cost due to the unchanged key value?

Thx soooo much~


Solution

  • A keyBy is always expensive, because it forces the records to go through ser/de. But in the case where the communication is local -- i.e., within the same task slot -- then Flink will use a shared buffer to communicate the serialized bytes, rather than going through the whole netty tcp stack. So yes, in your case the second keyBy is less expensive than the first one. But I would not say the cost is small.

    If you know that the keyBy is completely unnecessary, you can use reinterpretAsKeyedStream to get back to having a KeyedStream again without any of this overhead.