Search code examples
amazon-web-serviceshadoopmapreduceamazon-emrelastic-map-reduce

Amazon EMR MapReduce progress rollback?


Hi I just came up with a strange task:

I run a java-MapReduce jobs with EMR.

The data was about 1T and I used 1 master + 8 slaves.

All of the instances are r2.2xlarge.

Initially, everything looks fine like below:

INFO mapreduce.Job:  map 0% reduce 0%
INFO mapreduce.Job:  map 1% reduce 0%
INFO mapreduce.Job:  map 2% reduce 0%
INFO mapreduce.Job:  map 3% reduce 0%
INFO mapreduce.Job:  map 4% reduce 0%
INFO mapreduce.Job:  map 5% reduce 0%
INFO mapreduce.Job:  map 6% reduce 0%
INFO mapreduce.Job:  map 7% reduce 0%

...

However, I just noticed that the progress turned to rolling back (fall from like 7% to 1%).

INFO mapreduce.Job:  map 4% reduce 0%
INFO mapreduce.Job:  map 5% reduce 0%
INFO mapreduce.Job:  map 6% reduce 0%
INFO mapreduce.Job:  map 7% reduce 0%
INFO mapreduce.Job:  map 6% reduce 0%
INFO mapreduce.Job:  map 5% reduce 0%
INFO mapreduce.Job:  map 4% reduce 0%
INFO mapreduce.Job:  map 3% reduce 0%

....

When I test like 3G data, the result is right and the process went smoothly and there is no such situation shows up.

Could anyone tell me why this situation happened?

Best.


Solution

  • The displayed job progress is the unified status of the completed and in-progress tasks of the job as reported by the NodeManagers.

    Reversal of job progress suggests either the NodeManager has crashed or it is sending heartbeats and task statuses very infrequently to the ResourceManager. In both the cases, RM considers it as a NM failure and nullifies all the task progress reported by the NM for the particular incomplete job. The tasks that were completed successfully and the ones that were running before the crash has to be rerun by the ApplicationMaster. Thus, the contribution made by the failed NM towards the progress of the job becomes invalid and the job progress gets re-calculated.

    Here, the input volume being large might cause OOM errors or task timeouts. By default, mapreduce.task.timeout is 600ms (10 minutes). If the task does not show any progress within the timeout period, the task would fail. Multiple failures (3 by default) for a single job would blacklist the NM and the progress gets re-calculated. Nodemanager logs would provide more clarity.