Search code examples
hadoopmapreducehadoop-yarnhadoop-streaming

Where is the temp output data of map or reduce tasks


With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.

Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.

I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?

Update

After I have set these params in mapred-site.xml

<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>

and these params in hdfs-site.xml

<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>

And this value in core-site.xml

<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>

but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.

I have listed all directories in hdfs dfs -ls -R / and in the tmp dir I have only found the job configuration files.

drwx------   - root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r--   1 root supergroup          0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r--  10 root supergroup     112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r--  10 root supergroup       6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r--   1 root supergroup        797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r--   1 root supergroup      88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r--   1 root supergroup     439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r--   1 root supergroup     105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml

Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.


Solution

  • The output put of a task is in <output dir>/_temporary/1/_temporary.