Search code examples
pythonhadoopmapreducehadoop-streaming

Hadoop 2.7: MapReduce task's total time using streaming API


I am running Hadoop 2.7.1 on a local cluster (all nodes running Ubuntu 14.x or above). My mapreduce programs are written in Python and I am using the streaming API to run the task. I want to find out the total time that all the mapred tasks over all the nodes are taking. How to do that? I am not able to find the job files. (Perhaps removed from Hadoop 2.x onwards).


Solution

  • If you're looking for the sum of all the aggregate time spent in all your tasks, you'll likely want to look at the counters. These can be viewed on the job history server as well clicking on Counters on the left after drilling into individual jobs, or alternatively you can do this more programmatically using mapred job commands, for example, to print out all the summary statuses of SUCCEEDED jobs:

    mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
        xargs -n 1 mapred job -status
    

    The closest to "aggregate wall time" that counts as consumed time on your cluster would be "time spent in occupied slots", which is SLOTS_MILLIS_MAPS and SLOTS_MILLIS_REDUCES:

    mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
        xargs -n 1 -i mapred job -counter {} org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS
    mapred job -list all | grep SUCCEEDED | awk '{ print $1 }' | \
        xargs -n 1 -i mapred job -counter {} org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_REDUCES