Search code examples
javaapachehadoopmapreducehadoop-yarn

hadoop mapreduce teragen FAIL_CONTAINER_CLEANUP


I am experiencing some trouble with my hadoop cluster. I tried to do some benchmarks with it to check its performances and see if mapreduce works fine but i got some strange beahviours. The fact is that mapreduce is starting and treating its mapping phase but I got some errors from it : I used teragen for creating data first :

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 500 random-data

Then the job start and I got some failure without stopping the process:

17/02/23 12:29:27 INFO client.RMProxy: Connecting to ResourceManager at /172.16.138.145:8032

17/02/23 12:29:28 INFO terasort.TeraSort: Generating 500 using 2

17/02/23 12:29:28 INFO mapreduce.JobSubmitter: number of splits:2

17/02/23 12:29:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1487846108320_0007

17/02/23 12:29:28 INFO impl.YarnClientImpl: Submitted application application_1487846108320_0007

17/02/23 12:29:28 INFO mapreduce.Job: The url to track the job: http://172.16.138.145:8088/proxy/application_1487846108320_0007/

17/02/23 12:29:28 INFO mapreduce.Job: Running job: job_1487846108320_0007

17/02/23 12:29:34 INFO mapreduce.Job: Job job_1487846108320_0007 running in uber mode : false

17/02/23 12:29:34 INFO mapreduce.Job: map 0% reduce 0%

17/02/23 12:29:47 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_0, Status : FAILED

17/02/23 12:29:48 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_0, Status : FAILED

17/02/23 12:30:02 INFO mapreduce.Job: map 50% reduce 0%

17/02/23 12:30:02 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_1, Status : FAILED

17/02/23 12:30:03 INFO mapreduce.Job: map 0% reduce 0%

17/02/23 12:30:03 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_1, Status : FAILED

17/02/23 12:30:15 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000001_2, Status : FAILED

17/02/23 12:30:16 INFO mapreduce.Job: Task Id : attempt_1487846108320_0007_m_000000_2, Status : FAILED

17/02/23 12:30:30 INFO mapreduce.Job: map 100% reduce 0%

17/02/23 12:30:31 INFO mapreduce.Job: Job job_1487846108320_0007 failed with state FAILED due to: Task failed task_1487846108320_0007_m_000001

Job failed as tasks failed. failedMaps:1 failedReduces:0

I checked the logs in the concerned datanode and found the following lines repeating for each failure :

2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP

2017-02-23 11:36:12,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1487846108320_0001_m_000001_1:

2017-02-23 11:36:12,902 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1487846108320_0001_01_000004 taskAttempt attempt_1487846108320_0001_m_000001_1

2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1487846108320_0001_m_000001_1

2017-02-23 11:36:12,903 INFO [ContainerLauncher #5] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : Datanode3:34121

2017-02-23 11:36:12,923 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_CONTAINER_CLEANUP to FAIL_TASK_CLEANUP

2017-02-23 11:36:12,924 INFO [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: TASK_ABORT

2017-02-23 11:36:12,932 WARN [CommitterEvent Processor #2] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://172.16.138.145:9000/user/hdfs/random-dataSmallV7.7/_temporary/1/_temporary/attempt_1487846108320_0001_m_000001_1

2017-02-23 11:36:12,932 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1487846108320_0001_m_000001_1 TaskAttempt Transitioned from FAIL_TASK_CLEANUP to FAILED

In this case, the job failed but sometime I get the error but the job will be successful. (rarely) Do you know what could be the cause of this FAIL_CONTAINER_CLEANUP ? Or the potentials causes of this problem ? Here it is only using mappers and no reducer is solicited but when reducer are involves in other cases, the error happens too.

Thank you by advance for your ideas.


Solution

  • I finally solved it. I had a line in some /etc/hosts files referencing to my node : 127.0.1.1 Datanode1

    I replaced this line by the FQDN of my machine : 172.16.138.147 Datanode1

    This allowed hadoop to find the reference for my server and fix this error.

    I hope this will help others.