Search code examples
javahadoopmapreducedistinct-values

Selecting distinct records in Hadoop and using combiner


"MapReduce Design Patterns" book has pattern for finding distinct records in dataset. This is the algorithm:

map(key, record):
    emit record, null

reduce(key, records):
    emit key

On page 66 it says:

The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.

map phase emits record and NullWritable(which does not written on the wire). What Combiner tries to reduce? There is no record to reduce.


Solution

  • It tries to reduce the duplicates in a map output.

    Let's say you have text data of words in every line:

    John
    Adam
    John
    John
    

    There is no point in sending every John to the reducer if you can combine them after the map phase and only send:

    John
    Adam
    

    Which is distinct for each mapper already- thus saves bandwidth if you have a fair amount of non-distinct records in your split.