What's the difference between submitting a hadoop-streaming job using the yarn jar
command and using the hadoop jar
command?
This is from the current documentation:
hadoop jar hadoop-streaming-2.7.1.jar \
-D mapreduce.job.reduces=2 \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
But this command could be done just as well with:
yarn jar hadoop-streaming-2.7.1.jar \
-D mapreduce.job.reduces=2 \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc
If the two commands are equivillent (as I think they are), which is preferred, and why?
They are equal if your MapReduce framework is YARN. If not, hadoop jar
will run your jar file with MRv1 and yarn jar
will run your jar by YARN(MRv2).