Search code examples
javahadoopmapreducecombiners

Hadoop Combiner Class for Text


I'm still trying to get an intuition as to when to use the Hadoop combiner class (I saw a few articles but they did not specifically help in my situation).

My question is, is it appropriate to use a combiner class when the value of the pair is of the Text class? For instance, let's say we have the following output from the mapper:

fruit apple
fruit orange
fruit banana
...
veggie carrot
veggie celery
...

Can we apply a combiner class here to be:

fruit apple orange banana
...
veggie carrot celery
...

before it even reaches the reducer?


Solution

  • Combiners are typically suited to a problem where you are performing some form of aggregation, min, max etc operation on the data - these values can be calculated in the combiner for the map output, and then calculated again in the reducer for all the combined outputs. This is useful as it means you are not transferring all the data across the network between the mappers and the reducer.

    Now there is not reason that you can't introduce a combiner to accumulate a list of the values observed for each key (i assume this is what your example shows), but there are some things which would make it tricker.

    If you have to output <Text, Text> pairs from the mapper, and consume <Text, Text> in the reducer then your combiner can easily concatenate the list of values together and output this as a Text value. Now in your reducer, you can do the same, concatenate all the values together and form one big output.

    You may run into a problem if you wanted to sort and dedup the output list - as the combiner / reducer logic would need to tokenize the Text object back into words, sort and dedup the list and then rebuild the list of words.

    To directly answer your question - when would it be appropriate, well i can think of some examples:

    • If you wanted to find the lexicographical smallest or largest value associated with each key
    • You have millions of values for each key and you want to 'randomly' sample a small set the values