Search code examples

Trouble running Apache Giraph on YARN cluster (Hadoop 2.5.2)

I'm trying to run the basic ShortestPaths example using Giraph 1.1 on Hadoop 2.5.2. I'm running in actual cluster model (eg, not psuedo-distributed) and I can run standard mapreduce jobs OK. But when I try to run the Giraph example, it seems to hang unless I set

-ca giraph.SplitMasterWorker=false

and correspondingly set number of workers to 1. But this kinda defeats the point of running on a cluster, no? OTOH, if I run without disabling SplitMasterWorker, I get this error:

When using LocalJobRunner, you cannot run in split master / worker mode 
since there is only 1 task at a time!

which suggests that Girpah is defaulting to local mode. One report I read suggested fixing this by adding

-ca mapred.job.tracker=

to the Girpah command line, but on Hadoop 2.5.2 with YARN, there is no JobTracker on port 5431, if I understand correctly. Anyway, if I do add that bit, the job tries to run, but seems to hang without ever finishing. Here's the complete command line, and the job output follows:

[prhodes@ip-10-0-0-12 conf]$ hadoop jar /home/prhodes/giraph/giraph-
dependencies.jar org.apache.giraph.GiraphRunner 
org.apache.giraph.examples.SimpleShortestPathsComputation -vif 
-vip /user/prhodes/input/tiny_graph.txt -vof -op 
/user/prhodes/giraph_output/shortestpaths -w 3 -ca 

15/03/10 03:18:59 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/03/10 03:19:02 INFO server.NIOServerCnxnFactory: binding to port
15/03/10 03:19:02 INFO server.PrepRequestProcessor: zookeeper.skipACL=="yes", ACL checks will be skipped
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Connect attempt 1 of 10 max trying to connect to ip-10-0-0-12.ec2.internal:22181 with poll msecs = 3000
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Connected to ip-10-0-0-12.ec2.internal/!
15/03/10 03:19:05 INFO zk.ZooKeeperManager: onlineZooKeeperServers: Creating my filestamp _bsp/_defaultZkManagerDir/job_local1346154675_0001/_zkServer/ip-10-0-0-12.ec2.internal 0
15/03/10 03:19:05 INFO server.NIOServerCnxnFactory: Accepted socket connection from /
15/03/10 03:19:05 INFO graph.GraphTaskManager: setup: Chosen to run ZooKeeper...
15/03/10 03:19:05 INFO graph.GraphTaskManager: setup: Starting up BspServiceMaster (master thread)...
15/03/10 03:19:05 INFO bsp.BspService: BspService: Path to create to halt is /_hadoopBsp/job_local1346154675_0001/_haltComputation
15/03/10 03:19:05 INFO bsp.BspService: BspService: Connecting to ZooKeeper with job job_local1346154675_0001, 0 on ip-10-0-0-12.ec2.internal:22181
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Opening socket connection to server ip-10-0-0-12.ec2.internal/ Will not attempt to authenticate using SASL (unknown error)
15/03/10 03:19:05 INFO server.NIOServerCnxnFactory: Accepted socket connection from /
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-0-0-12.ec2.internal/, initiating session
15/03/10 03:19:05 INFO server.ZooKeeperServer: Client attempting to establish new session at /
15/03/10 03:19:05 INFO persistence.FileTxnLog: Creating new log file: log.1
15/03/10 03:19:05 INFO server.ZooKeeperServer: Established session 0x14c01b158f00000 with negotiated timeout 600000 for client /
15/03/10 03:19:05 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-0-0-12.ec2.internal/, sessionid = 0x14c01b158f00000, negotiated timeout = 600000
15/03/10 03:19:05 INFO bsp.BspService: process: Asynchronous connection complete.
15/03/10 03:19:05 INFO graph.GraphTaskManager: map: No need to do anything when not a worker
15/03/10 03:19:05 INFO graph.GraphTaskManager: cleanup: Starting for MASTER_ZOOKEEPER_ONLY
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x1 zxid:0x2 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_masterElectionDir Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_masterElectionDir
15/03/10 03:19:05 INFO master.BspServiceMaster: becomeMaster: First child is '/_hadoopBsp/job_local1346154675_0001/_masterElectionDir/ip-10-0-0-12.ec2.internal_00000000000' and my bid is '/_hadoopBsp/job_local1346154675_0001/_masterElectionDir/ip-10-0-0-12.ec2.internal_00000000000'
15/03/10 03:19:05 INFO netty.NettyServer: NettyServer: Using execution group with 8 threads for requestFrameDecoder.
15/03/10 03:19:05 INFO Configuration.deprecation: is deprecated. Instead, use mapreduce.job.maps
15/03/10 03:19:05 INFO netty.NettyServer: start: Started server communication server: ip-10-0-0-12.ec2.internal/ with up to 16 threads on bind attempt 0 with sendBufferSize = 32768 receiveBufferSize = 524288
15/03/10 03:19:05 INFO netty.NettyClient: NettyClient: Using execution handler with 8 threads after request-encoder.
15/03/10 03:19:05 INFO master.BspServiceMaster: becomeMaster: I am now the master!
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0xe zxid:0x9 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0
15/03/10 03:19:05 INFO bsp.BspService: process: applicationAttemptChanged signaled
15/03/10 03:19:05 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x16 zxid:0xc txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1 Error:KeeperErrorCode = NoNode for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1
15/03/10 03:19:05 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir, type=NodeChildrenChanged, state=SyncConnected)
15/03/10 03:19:07 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:07 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer ip-10-0-0-12.ec2.internal:22181 --zkNode /_hadoopBsp/job_local1346154675_0001/_haltComputation'
15/03/10 03:19:07 INFO mapreduce.Job: Running job: job_local1346154675_0001
15/03/10 03:19:08 INFO mapreduce.Job: Job job_local1346154675_0001 running in uber mode : false
15/03/10 03:19:08 INFO mapreduce.Job:  map 25% reduce 0%
15/03/10 03:19:10 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:19 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:28 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map
15/03/10 03:19:35 INFO master.BspServiceMaster: checkWorkers: Only found 0 responses of 3 needed to start superstep -1.  Reporting every 30000 msecs, 569976 more msecs left before giving up.
15/03/10 03:19:35 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x22 zxid:0x10 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir
15/03/10 03:19:35 INFO server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14c01b158f00000 type:create cxid:0x23 zxid:0x11 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_local1346154675_0001/_applicationAttemptsDir/0/_superstepDir/-1/_workerUnhealthyDir
15/03/10 03:19:40 INFO mapred.LocalJobRunner: MASTER_ZOOKEEPER_ONLY checkWorkers: Only found 0 responses of 3 needed to start superstep -1 > map


  • OK, this turned out to be fairly simple. I built Giraph using the hadoop_2 profile, and not hadoop_yarn. When I build it using the yarn profile, this no longer happens. I don't understand the entire mechanism of how this works, but apparently building with that profile changes some defaults that put it into pure YARN mode at runtime.

    So, if you get this, rebuild using

    mvn -Phadoop_yarn clean package

    and that will probably fix it.