Search code examples
hadoophadoop-yarncloudera-quickstart-vm

Debug Apache Slider package?


I went through the Slider Memcached Tutorial and was able to package/deploy/start the memcached container successfully; however when I package up a custom application, basically a Java jar plus dependencies, the container never launches succssfully.

The application page show the app is in a FINISHED/FAILED state with this diagnostic: http://quickstart.cloudera:8088/cluster/app/application_1439926335194_0001

Diagnostics: Unstable Application Instance : - failed with component MYAPP failed 'recently' 6 times (4 in startup); threshold is 5 - last failure: Failure container_1439926335194_0001_01_000008 on host quickstart.cloudera (0): http://quickstart.cloudera:19888/jobhistory/logs//quickstart.cloudera:8041/container_1439926335194_0001_01_000008/ctx/MYUSER

Part of the challenge in diagnosing the issue with the container is that the logs disappear after the application completes. http://quickstart.cloudera:8042/node/containerlogs/container_1439926335194_0001_01_000001/MYUSER

There is a troubleshooting page for slider which indicates that you can persist the logs beyond application completion: http://slider.incubator.apache.org/docs/troubleshooting.html

Configuring YARN for better debugging One configuration to aid debugging is tell the nodemanagers to keep data for a short period after containers finish

<!-- 10 minutes after a failure to see what is left in the directory-->
<property>
  <name>yarn.nodemanager.delete.debug-delay-sec</name>
  <value>600</value>
</property>

And I found this setting in Yarn - Configuration - NodeManager Base Group - Advanced - Localized Dir Delection Delay and changed it from the default of 0 to 1200; however even after I deploy client config, and restart Nodemanager + Yarn, even restart the VM, the logs are still getting deleted on container completion.

I'm working on the CDH 5.3.0 Vitrualbox VM image and the cluster + services appear to be working normally as I start up the package.

EDIT:

Only error in the log I see is this:

Role instance RoleInstance failed

2015-08-19 10:59:21,819 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role instance RoleInstance{role='SIMHASH', id='container_1439926335194_0002_01_000003', container=ContainerID=container_1439926335194_0002_01_000003 nodeID=quickstart.cloudera:8041 http=quickstart.cloudera:8042 priority=1073741825 resource=, createTime=1440007115649, startTime=1440007115674, released=false, roleId=1, host=quickstart.cloudera, hostURL=http://quickstart.cloudera:8042, state=5, placement=null, exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py --label container_1439926335194_0002_01_000003___SIMHASH --zk-quorum localhost:2181 --zk-reg-path /registry/users/c4/services/org-apache-slider/simhash1 > /slider-agent.out 2>&1 ; ', diagnostics='', output=null, environment=[LANGUAGE="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="C4", AGENT_LOG_ROOT="", PYTHONPATH="./infra/agent/slider-agent/", LC_ALL="en_US.UTF-8", SLIDER_PASSPHRASE="8R9ZPw3aZ20GFydi3OqvEtwYhh1qzfQBmWv6BjXepg3PCcyS8m", LANG="en_US.UTF-8"]} failed


Solution

  • Short Answer

    Look at the container logs to get the output from the running application.

    Details:

    I found the container logs via the containers web UI (on Cloudera VM it is http://quickstart.cloudera:8042/node/allContainers)

    There are 2 containers for my application, first just shows the logs I was looking at earlier indicating whether the container succeeded or failed; second has many logs with useful info (command / errors / slider-agent / status_command).

    They are transient, but I was able to look at them before the application terminated.

    slider-agent.out just has this line in it:

    No handlers could be found for logger "root"

    However slider-agent.log gave me the info I was looking for, basically the stderr / stdout from executing the Java command line so that is very helpful.

    INFO 2015-08-19 14:07:28,422 AgentToggleLogger.py:40 - Queue result: {'componentStatus': [], 'reports': [{'actionId': u'4-1', 'clusterName': u'myapp1', 'exitcode': 1, 'reportResult': True, 'role': u'MYAPP', 'roleCommand': u'START', 'serviceName': u'myapp1', 'status': 'FAILED', 'stderr': '2015-08-19 14:07:28,268 - Error while executing command ..., 'stdout': '2015-08-19 14:07:23,261 - Execute[\'/usr/java/latest/bin/java -Xmx256m -classpath ..., 'structuredOut': '{}', 'taskId': 4}]}