To read from distributed cache or from HDFS output

I'm implementing an iterative algorithm that produces some result in each iteration and that result is used in the map phase of the next iteration.

Should I make that result available for Mapper using distributed cache, or should I just read it from HDFS? What is more efficient?

That file should not be that big. The idea is just to read it in the setup phase and keep it in memory of mapper.

Thanks

Solution

If the file isn't that big and will be read in the setup of the mapper, DistributedCache is the way forward. Of course if you're not reading anything else into that second job, it begs the question as to why you're using a MapReduce job.

Reading from the HDFS (i.e. streaming a file into a mapper through an InputFormat) and using the DistributedCache have two quite different use cases. DistributedCache is designed for small files which can fit in memory, whereas reading into a mapper using an InputFormat is designed for large distributed datasets that can only be processed using a distributed process.

If your dataset is small enough for use in a DistributedCache, you can just use a Java job to process it and avoid the large overheads of MapReduce.