In mapreduce, how does the "shuffle step" decides where should go each key?

Lets considere the basic word count example for a map reduce job and the following input:

word1
word2
word1
word2
word3

For our processing, we consider that we have three mappers and three reducers.

For the mapping, the data is processed as followed:

MAP1: (word1,1), (word2,1)
MAP2: (word1,1), (word2,1)
MAP3: (word3,1)

Now, the shuffle phase starts. word1 keys need to be together, as well as word2 and word3 keys.

The shuffle phase could decide to send word1 to reducer1, word2 to reducer2 and word3 to reducer3, or word1 to reducer2, etc.

How is it decided which to which reducer will be shuffled each key ?

Solution

Before reduce step, hadoop uses implementation of Partitioner to determinate where key should be sent. By default it is HashPartitioner with method:

public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

You can use custom implementation if your job requires some additional logic:

job.setPartitionerClass(YourPartitioner.class)