Search code examples
hadoopmapreducepartitioningpartitioner

How is partitioned file with intermediate values on map worker in MapReduce?


I'm trying to understand MapReduce model and I need advice because I'm not sure about the way how is sorted and partitioned file with intermediate results of map function. The most my knowledges about MapReduce I got from MapReduce papers of Jeffrey Dean & Sanjay Ghemawat and from Hadoop: The Definitive Guide.

The file with intermediate results of map function is compound of small sorted and partitioned files. These small files are divided into partitions corresponding to reduce workers. Then small files are merged into one file. I need to know how is partitioning of small files done. First I thought that every partition has some range of keys.

For example: if we've got keys as integer in range <1;100> and file is divided to three partitions then the first partition can consists of values with keys in range <1,33>, second partition with keys in range <34;66> and third partition <67;100>. The same partitioning is in merged file too.

But I'm not sure about it. Every partition is send to corresponding reduce worker. In our example, if we have two reduce workers then partitions with first two ranges of keys (<1,33> and <34;66>) can be sent to first worker and last partition to third worker. But if I'm wrong and the files are divided in another way (I mean that partitions hasn't got their own range of possible keys) then every reduce worker can has results for the same keys. So I need somehow merge results of these reduce workers, right? Can I send these results to master node and merge them there?

In short version: I need explain the way how files in map phase are divided (if my description is wrong) and explain how and where I can process results of reduce workers.

I hope I described my problem enough to understand. I can explain it more, of course.

Thanks a lot for your answers.


Solution

  • There is a Partitioner class that does this. Each key/value pair in the intermediate file is passed to the partitioner along with the total number of reducers (partitions) and the partitioner returns the partition number that should handle that specific key/value pair.

    There is a default partitioner that does an OK job of partitioning, but if you want better control or if you have a specially formatted (e.g. complex) key then you can and should write your own partitioner.