amazon-web-services amazon-ec2 netty apache-zookeeper giraph

Giraph application get stuck, on superstep 4, all workers active but without progress

I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the dump (https://dumps.wikimedia.org/eswiki/20160601) into a file that could be read with Giraph.

The BFS is searching for paths, and its all ok until get stuck in some point of the superstep four.

I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node is a r3.8xlarge ec2 instance. The command for executing the BFS is this one:

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,giraph.logLevel=Debug

Each container have 120GB (almost). I'm using 1000M messages limit in outOfCore, because i believed that was the problem, but apparently is not.

This ones are the master logs (it seems that is waiting for workers for finish but they just don't...and keeps like this forever...):

6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
...thirty times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got finished worker list = [], size = 0, worker list = [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4

And in all workers, there is no information on what is happening (i'm testing this with giraph.logLevel=Debug because with the default level of giraph log i was lost), and the workers say this over and over again:

16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82

Before starting the superstep 4, the information on each worker was the following one

16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID
=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)]
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory (free/total/max) = 92421.41M / 115000.00M / 115000.00M

I don't know what is exactly failing:

i know that all containers have memory available, on datanodes i check that each one had like 50 GB available.
I'm not sure if i'm hitting some sort of limit in the use of outOfCore. I know that writing messages too fast is dangerous with 1.1 version of Giraph, but if i hit that limit, i suppose that the container will fail, right?
Maybe the connections for zookeeper client aren't enough? I read that maybe the 60 default value in zookeeper for maxClientCnxns is too small for a context like AWS, but i'm not fully aware of the relationship between Giraph and Zookeeper for start changing default configuration values
Maybe i have to tune outOfCore configuration? Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true like someone recommend here (http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majakabiljo@fb.com%3E) ?
Should i tune the netty configuration? I have the default configuration, but i believe that maybe using only 8 netty client and 8 server threads will be enough, since that i have only a few workers and maybe too much threads of netty are making the overhead that is doing that entire application get stuck
Using giraph.useBigDataIOForMessages=true didn't help me either, i know that each vertex is receiving 100 M or more messages and that property should be helpful, but didn't make any difference anyway

As you maybe are suspecting, i have too many hypothesis, that's why i'm seeking for help, so i can go in the right direction.

EDIT 1: Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true didn't solve the problem

EDIT 2: I duplicated the netty threads, and assigned the double of the original size to netty buffers, and no change.

EDIT 3: I condensed the messages, 1000 into 1, and get a lot of less messages, but still, same final results.

Solution

I isolate the problem using 15 compute threads and 240 partitions.

I could observe that one partition and one thread was taking too long.

So i check the code, searching for something that could take that long. And i found it, i was concatenating strings using + instead of StringBuilder (i believed that the compiler switch the "+" for StringBuilder.append(), as someone suggested here). But this wasn't the case. The time decreases dramatically and the application finishs, finally.