Search code examples
hadoopmapreducehadoop-streaminghadoop-partitioningcombiners

Who will get a chance to execute first , Combiner or Partitioner?


I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204)

  • Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.

  • Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

  • Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

Here is my doubt:

1) Who will execute first combiner or partitions !!

2) When custom combiner and custom partitions will be there so how and what will be the execution steps hierarchy ?

3) Can we feed compress data (avro ,sequence ..etc) to Custom combiner ,if yes then how!!

Looking for a brief and in-depth explanation!!

Thanks in advance.


Solution

  • 1/ The response is already specified in this part: "Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort."

    So firstly the partitions are created in-memory, if there is a custom combiner, it will be executed in-memory, and the result will be spilled to disk at the end.

    2/ custom combiner and custom partition will be there when they are specified on the driver class.

    job.setCombinerClass(MyCombiner.class);
    job.setPartitionerClass(MyPartitioner.class);
    

    If there is no custom combiner specified, so there is no combiner executed. If there is no custom partitioner specified, so the default executed partitioner is "HashPartitioner" (please see the page 221 for that).

    3/ Yes, it is possible. Don't forget that the mechanism of the combiner is the same than the reducer. The reducer can consume compressed data. If the consumer consumes the compressed data, that means that the input files format is compressed. for that, you can specify on the driver class the instruction:

    Sequence File case: job.setInputFormatClass(SequenceFileInputFormat.class);
    Avro File case: job.setInputFormatClass(AvroKeyInputFormat.class);