"MapReduce Design Patterns" book has pattern for finding distinct records in dataset. This is the algorithm:
map(key, record):
emit record, null
reduce(key, records):
emit key
On page 66 it says:
The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.
map phase emits record and NullWritable
(which does not written on the wire). What Combiner
tries to reduce? There is no record to reduce.
It tries to reduce the duplicates in a map output.
Let's say you have text data of words in every line:
John
Adam
John
John
There is no point in sending every John
to the reducer if you can combine them after the map phase and only send:
John
Adam
Which is distinct for each mapper already- thus saves bandwidth if you have a fair amount of non-distinct records in your split.