Lets considere the basic word count example for a map reduce job and the following input:
word1
word2
word1
word2
word3
For our processing, we consider that we have three mappers and three reducers.
For the mapping, the data is processed as followed:
MAP1: (word1,1), (word2,1)
MAP2: (word1,1), (word2,1)
MAP3: (word3,1)
Now, the shuffle phase starts. word1
keys need to be together, as well as word2
and word3
keys.
The shuffle phase could decide to send word1
to reducer1
, word2
to reducer2
and word3
to reducer3
, or word1
to reducer2, etc.
How is it decided which to which reducer will be shuffled each key ?
Before reduce step, hadoop uses implementation of Partitioner
to determinate where key should be sent.
By default it is HashPartitioner
with method:
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
You can use custom implementation if your job requires some additional logic:
job.setPartitionerClass(YourPartitioner.class)