Search code examples
hadoopcloudelastic-map-reduce

Mapper and Reducer in Hadoop


I have a confusion about the implementation of Hadoop.

I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.

Thus, I am wondering how MapReduce works such that a key only goes to one output file?

Thanks in advance.


Solution

  • The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that

    Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.