Search code examples
hadoopmapreducecombiners

Hadoop combiner sort phase


When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.

If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?

Thanks in advance!


Solution

  • Combiners are there to save network bandwidth.

    The mapoutput directly gets sorted:

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    

    This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.

    The important parts are in the MapTask, if you'd like to see it for yourself.

        sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
        // some fields
        for (int i = 0; i < partitions; ++i) {
            // check if configured
            if (combinerRunner == null) {
              // spill directly
            } else {
                combinerRunner.combine(kvIter, combineCollector);
            }
        }
    

    This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered. During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.

    Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.