I am learning hadoop mapreduce, and I am working with the Java API. I learnt about the TotalOrderPartitioner used to 'globally' sort the output by keys, across the cluster and that it needs a partition file (generated using InputSampler):
job.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.1, 200);
InputSampler.writePartitionFile(job, sampler);
I have a couple of doubts and I seek help from the community:
What does the word 'sorted globally' exactly mean here? How exactly is the output sorted, we still have multiple output part files that are distributed across the cluster?
What happens if we do not supply the partition file? Is there a default way to handle this situation?
Lets explain it with an example. Let's say your partition file looks like this:
H
T
V
This makes up for 4 ranges when your keys are ranging from A to Z:
1 [A,H)
2 [H,T)
3 [T,V)
4 [V,Z]
When a mapper now sends a record to the reducer the partitioner has a look at the key of your output. Let's say the outputs of all mappers are as follows:
A,N,C,K,Z,S,U
Now the partitioner checks your partition file and sends the records to the corresponing reducer. Let us assume you have defined 4 reducers, so each reducer will handle one range:
Reducer 1 handles A,C
Reducer 2 handles N,K,S
Reducer 3 handles U
Reducer 4 handles Z
This gives, that your partition file must hold at least n-1
elements compared to the number of reducers you are using. Another important note from the docs:
If the keytype is BinaryComparable and total.order.partitioner.natural.order is not false, a trie of the first total.order.partitioner.max.trie.depth(2) + 1 bytes will be built. Otherwise, keys will be located using a binary search of the partition keyset using the RawComparator defined for this job. The input file must be sorted with the same comparator and contain JobContextImpl.getNumReduceTasks() - 1 keys.