I understand that for including a combiner in Hadoop MapReduce the following line is included (which I have done already);
conf.setCombinerClass(MyReducer.class);
What I don't understand is that where do I actually implement the functionality of the combiner. Do I create a combine{} method under MyReducer? such as the reduce method;
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { }
Many thanks in advance!
A Combiner
should simply be a Reducer
, and thusly implement the Reducer
interface (there is no Combiner
interface). Think of the combining step as a sort of intermediary reducing step between the Mapper
and Reducer
.
Take the Word Count example. From Yahoo's tutorial:
Word count is a prime example for where a Combiner is useful. The Word Count program in listings 1--3 emits a (word, 1) pair for every instance of every word it sees. So if the same document contains the word "cat" 3 times, the pair ("cat", 1) is emitted three times; all of these are then sent to the Reducer. By using a Combiner, these can be condensed into a single ("cat", 3) pair to be sent to the Reducer. Now each node only sends a single value to the reducer for each word -- drastically reducing the total bandwidth required for the shuffle process, and speeding up the job. The best part of all is that we do not need to write any additional code to take advantage of this! If a reduce function is both commutative and associative, then it can be used as a Combiner as well.
Hope that helps.