Search code examples
javafile-iohadoopmapreducedistributed-cache

Hadoop cache file for all map tasks


My map function has to read a file for every input. That file doesn't change at all, it is only for reading. Distributed cache might help me a lot i think, but i cant find a way to use it. The public void configure(JobConf conf) function that i need to override, i think is deprecated. Well JobConf is deprecated for sure. All the DistributedCache tutorials use the deprecated way to. What can i do? Is there another configure function that i can override??

These are the very first lines of my map function:

     Configuration conf = new Configuration();          //load the MFile
     FileSystem fs = FileSystem.get(conf);
     Path inFile = new Path("planet/MFile");       
     FSDataInputStream in = fs.open(inFile);
     DecisionTree dtree=new DecisionTree().loadTree(in);

I want to cache that MFile so that my map function doesn't need to look it over and over again


Solution

  • Jobconf was deprecated in 0.20.x but in 1.0.0 it is not! :-) (as of writing this)

    To your question, there are two ways to run map reduce jobs in java, one is by using (extending) classes in org.apache.hadoop.mapreduce package and other is by implementing classes in org.apache.hadoop.mapred package (or the other way round ).

    Not sure which one you are using, if you don't have a configure method to override, you will get a setup method to override.

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    

    This is similar to configure and should help you.

    You get a setup method to override when you extend Mapper class in org.apache.hadoop.mapreduce package.