I am new in hadoop and mapreduce partitioner.I want to write my own partitioner and i need to read a file in partitioner. i have searched many times and i get that i should use distributed cache. this is my question that how can i use distributed cache in my hadoop partitioner? what should i write in my partitioner?
public static class CaderPartitioner extends Partitioner<Text,IntWritable> {
@Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
return 0;
}
}
Thanks
The easiest way to work this out is to look at the example Partitioners included with hadoop. In this case the one to look at is the TotalOrderPartitioner
which reads in a pre-generated file to help direct keys.
You can find the source code here, and here's gist showing how to use it.
Firstly you need to tell the partitioner where the file can be found in your mapreduce jobs driver (on HDFS):
// Define partition file path.
Path partitionPath = new Path(outputDir + "-part.lst");
// Use Total Order Partitioner.
job.setPartitionerClass(TotalOrderPartitioner.class);
// Generate partition file from map-only job's output.
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionPath);
In the TotalOrderPartitioner
you'll see that it implements Configurable
which gives it access to the configuration so it can get the path to the file on HDFS.
The file is read in the public void setConf(Configuration conf)
method, which will be called when the Partitioner object is created. At this point you can read the file and do whatever set-up you want.
I would think you can re-use a lot of the code from this partitioner.