Search code examples
javahadoopdistributed-cache

Hadoop 1.2.1 - using Distributed Cache


I have developed a Hadoop application that uses distributed cache. I used Hadoop 2.9.0. Everything works fine in stand-alone and pseudo-distributed mode.

Driver:

public class MyApp extends Configured implements Tool{
public static void main(String[] args) throws Exception{
        if(args.length < 2) {
            System.err.println("Usage: Myapp -files cache.txt <inputpath> <outputpath>");

        System.exit(-1);
    }

    int res = ToolRunner.run(new Configuration(), new IDS(), args);
    System.exit(res);

...

Mapper:

public class IDSMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
    protected void setup(Context context) throws IOException {
        BufferedReader bfr = new BufferedReader(new FileReader(new File("cache.txt")));

Starting: sudo bin/hadoop jar MyApp.jar -files cache.txt /input /output

Now I need to measure execution time on a real Hadoop cluster. Unfortunately, I have Hadoop cluster with Hadoop 1.2.1 version on my disposal. So I created new Eclipse project, referenced appropriate Hadoop 1.2.1 jar files, and evertything works fine in stand-alone mode. However, pseudo-distributed mode with Hadoop 1.2.1 fails with an FileNotFoundException in Mapper class (setup method), when trying to read distributed cache file.

Do I have to handle distributed cache files in some other way in Hadoop 1.2.1 ?


Solution

  • Problem was in the run method. I used Job.getInstance method without parameters, and I should use it this way:

    Job job = Job.getInstance(getConf());
    

    I still don't know why Hadoop 2.9.0 works with just:

    Job job = Job.getInstance();
    

    but getConf solved my problems.