Search code examples
typesmapreduce

Why setMapOutputKeyClass method is necessary in mapreduce job


When I write the mapreduce program, I often write the code like

 job1.setMapOutputKeyClass(Text.class); 

But why should we specify the MapOutputKeyClass explicitly? We have already spicify it in the mapper class such as

public static class MyMapper extends
        Mapper<LongWritable, Text, Text, Text>

In the book Hadoop:The definitive Guide, there is a table shows that the method setMapOutputKeyClass is optional(Properties for configuring types), but as I test, I found it is necessary, or the Console of eclipse will show

Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text

Can someone tell me the reason of it?

In the book, it says

"The settings that have to be compatible with the MapReduce types are listed in the lower part of Table 8-1". Does it mean we have to set the lower part property type, but do not have to set the higher part ones?

the content of the table looks like this:

Properties for configuring types:
mapreduce.job.inputformat.class  
mapreduce.map.output.key.class  
mapreduce.map.output.value.class  
mapreduce.job.output.key.class  
mapreduce.job.output.value.class 

Properties that must be consistent with the types:
mapreduce.job.map.class   
mapreduce.job.combine.class  
mapreduce.job.partitioner.class  
mapreduce.job.output.key.comparator.class 
mapreduce.job.output.group.comparator.class  
mapreduce.job.reduce.class  
mapreduce.job.outputformat.class

Solution

  • setMapOutputKeyClass() as well as setMapOutputValueClass() are optional as long as they match your job's output types specified by setOutputKeyClass() and setOutputValueClass() respectively. In other words, if your mapper output does not match your reducer output you have to use one or both of these methods.

    As for your question regarding generic arguments, due to Java type erasure (Java generics type erasure: when and what happens?), Hadoop does not know them at runtime, even though they are known to the compiler.