Search code examples
apache-flink

The strategy of Apache Flink shuffle? Is it like shuffle in Hadoop?


Such as in hadoop , there is a shuffle phase between map and reduce . And I want to know if there is such a stage in flink, and how it works .Because I have read a lot of websites, they did not mention much about that.Such as a wordcount demo , it has a flatmap,key and sum.Are there always a shuffle phase between two operators ?And can I get the Intermediate data between these operators?


Solution

  • Shuffle is not always performed and it depends on only specific operators. In case of your example, the keyby step in the wordCount example introduces a hash partitioner which performs shuffling of the data based on the key.

    In other cases for example - if you want to just process and filter your data without some form of aggregation and then write somewhere, then each of your partitions would hold its own data and there wouldn't be any kind of shuffling involved.

    So to answer your questions -

    1. No, shuffling is not always involved between 2 operators and it depends.
    2. If you are asking about some intermediate files which you can access like in Hadoop, then the answer is No, Flink is an in-memory processing engine and (in most cases) processes data which is read in memory.