Search code examples
hadoopmapreducedistributed-cache

How to use a MapReduce output in Distributed Cache


Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.


Solution

  • The below steps might help you,

    • Pass the first job's output directory path to the Second job's Driver class.

    • Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

          FileSystem fs = FileSystem.get(conf);
          FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                                     new PathFilter(){
                                           @Override public boolean accept(Path path){
                                                  return path.getName().startsWith("part-");
                                           } 
                                      } );
      
    • Iterate over every part-* file and add it to distribute cache.

          for(int i=0; i < fileList.length;i++){ 
                   DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
          }