java hadoop mapreduce load-balancing hadoop-partitioning

Which logic should be followed using custom partitioner in map reduce to solve this

If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

Solution

SOLUTION 1: I think the way to go is a combiner, rather than a partitioner. A combiner will aggregate the local sums of words starting with letter 'A' and then emit the partial sum (rather than number 1 always) to the reducers.

SOLUTION 2: However, if you insist on using a custom partitioner for this, you can simply handle words starting with letter 'A' in a separate reducer than all other words, i.e., dedicate a reducer only for words starting with letter 'A'.

SOLUTION 3: Moreover, if you don't mind "cheating" a little bit, you can define a counter for words starting with letter 'A' and increment it in the map phase. Then, just ignore those words (there is no need to send them through the network) and use the default partitioner for the other words. When the job finishes, retrieve the value of the counter.

SOLUTION 4: If you don't mind "cheating" even more, define 26 counters, one for each letter, and just increment them in the map phase, according to the first letter of the current word. You can use no reducers (set the number of reducers to 0). This will save all the sorting and shuffling. When the job finishes, retrieve the value of all the counters.