Search code examples
javahadoopmapreducehdfsdistributed-computing

Is it possible to restrict a MapReduce job from accessing remote data?


We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?

UPDATE: Adding a bit of context

We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?


Solution

  • If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.

    While creating your job this can be done with the following line of code

    job.setNumReduceTasks(0);
    

    I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.

    First google match on a map-only example I found: Map-Only MR jobs