java interface hadoop mapreduce elastic-map-reduce

Interface as Mapper value output

I have a mapper whose output value is set to be an interface like this:

public interface OutValue extends Writable {}

During mapping, I create objects with this signature and emit them:

public class OutRecord implements OutValue {}

My Mapper is like this:

public class ExampleMapper extends
    Mapper<LongWritable, Text, ExampleKey, OutValue > {}

However I am getting this error:

java.io.IOException: Type mismatch in value from map: expected OutValue, recieved OutRecord
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:850)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

So my first instinct was to try to cast it like this:

context.write(key, (OutValue) record);

However I still get the same error. This worked prior to me changing the mapper output type from the OutRecord class to the OutValue interface. My reason for doing so is that I have many types of OutRecord classes I'd like to emit from this mapper.

Is this possible? Does OutValue have to be a class instead of an interface?

Update:

I dug through some of the source for Hadoop 0.20.205.0 and I found this:

public synchronized void collect(K key, V value, int partition) throws IOException {
...
if (value.getClass() != valClass) {
  throw new IOException("Type mismatch in value from map: expected "
                        + valClass.getName() + ", recieved "
                        + value.getClass().getName());
}

So the runtime checks they are using require strict equality in the class, they do no checking for subclassing / interfaces, etc. Surely this is a common usecase, has anyone tried to do this?

Solution

There are a few reasons for this strict checking of types:

If you're outputting to a sequence files, the header of this file contains the types of the Key and Value class. Hadoop then uses the registered serializer to create new instances of these objects when the sequence file is read back in.

If the classes you register as the output types are interfaces, or the actual objects you output are sub-classes of the declared type, than either you will not be able to instantiate the interface at runtime, or the instantiated class will not be the sub class you are expecting (and your deserialization will most probably fail with an IOException).

(when i started typing this i had another reason in mind but it has escaped me for the time being).

Now if you want to be able to output different types (subclasses) then look into using the GenericWritable to 'wrap' your objects - in this case each object output is preceeded by a type - look at the source and javadocs for more details.

Be warned, that this comes at some cost - input and output will not utilize the object re-use seen elsewhere in hadoop, but you may not notice this cost. You can re-write GenericWritable to be more efficient by pooling an object for each instance type seen and re-using it in the usual way.