I am trying to attach a custom (java) partitioner to my MapReduce streaming job. I am using this command:
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ./NumericPartitioner.jar -D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6
The important bit of that is the file NumericPartitioner.jar, which resides in the same folder the command is being run in (a level down from the Hadoop root installation.) Here is its code:
package newjoin;
import java.util.*;
import java.lang.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class NumericPartitioner extends Partitioner<Text,Text>
{
@Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return Integer.parseInt(key.toString().split("\\s")[0]) % numReduceTasks;
}
}
And yet, when I try to run the above command, I get:
-partitioner : class not found : newjoin.NumericPartitioner
Streaming Command Failed!
What's going on here, and how can I get mapReduce to find my partitioner?
-libjars option is to make your third-party JAR’s available to the remote map and reduce task JVM’s. But for making these same third party JAR’s available to the client JVM( JVM that’s created when you run the hadoop jar command) , you need to specify in HADOOP_CLASSPATH variable
$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:./NumericPartitioner.jar
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ${HADOOP_CLASSPATH}
-D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6