I am a newbie to MapReduce and I just can't figure out the difference in the partitioner and combiner. I know both run in the intermediate step between the map and reduce tasks and both reduce the amount of data to be processed by the reduce task. Please explain the difference using an example.
I think a little example can explain this very clearly and quickly.
Let's say you have a MapReduce Word Count job with 2 mappers and 1 reducer .
"hello hello there"
=> mapper1 => (hello, 1), (hello,1), (there,1)
"howdy howdy again"
=> mapper2 => (howdy, 1), (howdy,1), (again,1)
Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)
"hello hello there"
=> mapper1 with combiner => (hello, 2), (there,1)
"howdy howdy again"
=> mapper2 with combiner => (howdy, 2), (again,1)
Both outputs get to the reducer => (again, 1), (hello, 2), (howdy, 2), (there, 1)
The end result is the same, but when using a combiner, the map output is reduced already. In this example you only send 2 output pairs instead of 3 pairs to the reducer. So you gain IO/disk performance. This is useful when aggregating values.
The Combiner is actually a Reducer applied to the map() outputs.
If you take a look at the very first Apache MapReduce tutorial, which happens to be exactly the mapreduce example I just illustrated, you can see they use the reducer as the combiner :
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);