Search code examples
hadoophadoop-streaminghadoop-partitioning

How Mapper and Reducer works together "without" sorting?


I know how the map reduces works and what steps I have:

  • Mapping
  • Shuffle and sorting
  • Reducing

Off course I have Partitioning, Combiners but that's not important right now.

The interesting is that when I run map reduce jobs, looks like mappers and reducers work in parallel:

enter image description here

So I don't understand how it is possible.

Question 1. If I have multiple nodes that are doing mapping operation, how reducer can start working? Because Reducer can't start working without sorting right? (The input must be sorted for Reducer - if the mapper is still working, input can't be sorted).

Question 2. If I have multiple reducers, how the final data will be merged together? In other words, final results should be sorted right? It means we spend additional O( n*Log n) time to merge "multiple reducer results?"


Solution

  • Reducers can start copying results from mappers as soon as they become available. It is called copy phase of the reduce task (see Hadoop the Definitive Guide, Chapter 7 How MapReduce Works).
    Also from there:

    ...When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds...