I'm trying to solve an issue with a newly added datanode on our Hortonworks cluster. The YARN namenode manager of the node would fail, shortly after starting. The following error message log is returned:
Connection failed to http://(ipaddress):8042/ws/v1/node/info (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 198, in curl_krb_request
_, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --location-trusted -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 -c /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 http://gdcdrwhdb821.dir.ucb-group.com:8042/ws/v1/node/info --connect-timeout 5 --max-time 7 1>/tmp/tmp7pZrbM 2>/tmp/tmpgM4wdg' returned 7. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to (ipaddress):8042; Connection refused
)
This doesn't really tell me WHY the connection was refused though, except that whatever Yarn process corresponds to port 8042 isn't running:
netstat -tulpn | grep 8042
I've been looking for another nodemanager log perhaps with more information, but cannot find anything useful under /var/log/hadoop-yarn or the yarn.nodemanager.local-dirs / yarn.nodemanager.log-dirs
Are there other places I can look for yarn nodemanager error logs? Does anyone know what could be causing this?
Edit: After re-checking I found this useful bit in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-(ipaddress).log
2017-04-19 14:01:14,670 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService
Not sure if this helps now. Probably you might have already solved it.
You are using external shuffle service. This runs as an auxiliary service inside nodemanager service. Currently it's not able to find shuffle service jar in classpath.
Please add location of shuffle service jar to yarn.application.classpath in yarn-site.xml