Search code examples
javasortinghadoopmapreducehadoop2

How to sort a custom writable type in Hadoop


I have a custom type which contains fields of Hadoop native types (e.g. Text and IntWritable) and need to use it as a key and sort as I want during the shuffle/sort phase. There are similar questions like this one and this one, but they are about using native types. How to achieve the same results with the custom type, what requirements does it need to meet?


Solution

  • There are nuances to achieve this result, some are obvious, the others are not so. I'll try to explain them in a few short points:

    1. First of all the custom type must implement WritableComparable instead of just Writable and, of course, define compareTo() method.
    2. Very important note from Hadoop: The Definitive Guide:

      All Writable implementations must have a default constructor so that the MapReduce framework can instantiate them, then populate their fields by calling readFields().

      And maybe the most error-prone part is that the default constructor should instantiate that fields (if they are not initialized), because they must not be null.

    3. This point is about creating a custom comparator, if you are not satisfied with default sorting. In this case you need to create a new class, which extends WritableComparator and override its compare() method. After this you have two approaches of using this comparator instead of the default one: or you set this class to be used with the help of Job's setSortComparatorClass method:

      job.setSortComparatorClass(YourComparator.class)
      

      or register it in the static block of your custom type:

      static {  
          WritableComparator.define(CustomType.class, new YourComparator());
      }
      

      The static block registers the raw comparator so that whenever MapReduce sees the class, it knows to use the raw comparator as its default comparator.

    Here is an example of such class with static nested comparator.