How to use a MapReduce output in Distributed Cache

Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.

Solution

The below steps might help you,

Pass the first job's output directory path to the Second job's Driver class.

Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

    FileSystem fs = FileSystem.get(conf);
    FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                               new PathFilter(){
                                     @Override public boolean accept(Path path){
                                            return path.getName().startsWith("part-");
                                     } 
                                } );

Iterate over every part-* file and add it to distribute cache.

    for(int i=0; i < fileList.length;i++){ 
             DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
    }