Search code examples
javahadoopmapreducehdfsmahout

Hadoop: All datanodes 127.0.0.1:50010 are bad. Aborting


I'm running an example from Apache Mahout 0.9 (org.apache.mahout.classifier.df.mapreduce.BuildForest) using the PartialBuilder implementation on Hadoop, but I'm getting an error no matter what I try.

The error is:

14/12/10 10:58:36 INFO mapred.JobClient: Running job: job_201412091528_0004
14/12/10 10:58:37 INFO mapred.JobClient:  map 0% reduce 0%
14/12/10 10:58:50 INFO mapred.JobClient:  map 10% reduce 0%
14/12/10 10:59:44 INFO mapred.JobClient:  map 20% reduce 0%
14/12/10 11:32:23 INFO mapred.JobClient: Task Id : attempt_201412091528_0004_m_000000_0, Status : FAILED
java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting...
  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3290)
  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783)
  at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987)

There are no glaring errors in the datanode log file:

2014-12-10 11:32:19,157 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:62662, bytes: 549024, op: HDFS_READ, cliID: DFSClient_attempt_201412091528_0004_m_000000_0_1249767243_1, offset: 0, srvID: DS-957555695-10.0.1.9-50010-1418164764736, blockid: blk_-7548306051547444322_12804, duration: 2012297405000
2014-12-10 11:32:25,511 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:64109, bytes: 1329, op: HDFS_READ, cliID: DFSClient_attempt_201412091528_0004_m_000000_1_-1362169490_1, offset: 0, srvID: DS-957555695-10.0.1.9-50010-1418164764736, blockid: blk_4126779969492093101_12817, duration: 285000
2014-12-10 11:32:29,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:64110, bytes: 67633152, op: HDFS_READ, cliID: DFSClient_attempt_201412091528_0004_m_000000_1_-1362169490_1, offset: 0, srvID: DS-9575556

... or in the namenode file. The jobtracker just repeats the error found in the datanode log. The one error that precedes the failure by several minutes is an EOF error, that may or may not be a problem for PartialBuilder:

2014-12-10 12:12:22,060 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-957555695-10.0.1.9-50010-1418164764736, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:296)
  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:340)
  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:404)
  at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:582)
  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:404)
  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
  at java.lang.Thread.run(Thread.java:695)

I am able to read and write files to DFS directly. I can even run this job on a small subset of the data, but I can't get this Map/Reduce job to work correctly. Any idea what I'm doing wrong?

Notes about my setup:

  • Everything is running locally
  • Apache Mahout version 0.9
  • Hadoop version 1.2.1
  • Java version 1.6.0_65

hdfs-site.xml settings:

  • dfs.replication = 4
  • dfs.permissions = false
  • dfs.name.dir = /Users/juliuss/Dev/hdfs/name/
  • dfs.data.dir = /Users/juliuss/Dev/hdfs/data/
  • mapred.child.java.opts = -Xmx4096m
  • dfs.datanode.socket.write.timeout = 0

Solution

  • After messing with a million settings, none of which worked, I finally resolved this issue by dramatically reducing the split size:

    -Dmapred.max.split.size=16777216
    

    This increased the number of mappers from 10 to 40 for this dataset, which allowed them to complete correctly. Now that I've isolated the issue, I'm going to steadily increase the split size to find the right number. (For Random Forests you should find the largest split possible to get the best results.)

    Unfortunately, I don't know why the split size was causing the "All datanodes are bad. Aborting" error, as it's not the error I would expect.