Search code examples
pythonhadoopmapreducecluster-computingdistributed-computing

Call mapper when reducer is done


I am executing the job as:

hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  -D mapred.reduce.tasks=2 -file kmeans_mapper.py    -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out

When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as its input. How to do that easily?

I checked this blog which has a Mrjob example, which doesn't explain, I do not get how to do mine.

The MapReduce tutorial states:

Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

but it doesn't give any example...

Here is some code in Java I could understand, but I am writing Python! :/


This question sheds some light: Chaining multiple mapreduce tasks in Hadoop streaming


Solution

  • It is possible to do what you're asking for using the Java API as you've found an example for.

    But, you are using the streaming API which simply reads standard in and writes to standard out. There is no callback to say when a mapreduce job has completed other than the completion of the hadoop jar command. But, because it completed, doesn't really indicate a "success". That being said, it really isn't possible without some more tooling around the streaming API.

    If the output was written to the local terminal rather than to HDFS, it might be possible to pipe that output into the input of another streaming job, but unfortunately, the inputs and outputs to the steaming jar require paths on HDFS.