Search code examples
hadoop-yarn

Hortonworks Nodemanager starts but then fails: Connection refused to :8042


I'm trying to solve an issue with a newly added datanode on our Hortonworks cluster. The YARN namenode manager of the node would fail, shortly after starting. The following error message log is returned:

Connection failed to http://(ipaddress):8042/ws/v1/node/info (Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
    connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 198, in curl_krb_request
    _, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --location-trusted -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 -c /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 http://gdcdrwhdb821.dir.ucb-group.com:8042/ws/v1/node/info --connect-timeout 5 --max-time 7 1>/tmp/tmp7pZrbM 2>/tmp/tmpgM4wdg' returned 7.   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to (ipaddress):8042; Connection refused
)

This doesn't really tell me WHY the connection was refused though, except that whatever Yarn process corresponds to port 8042 isn't running:

netstat -tulpn | grep 8042

I've been looking for another nodemanager log perhaps with more information, but cannot find anything useful under /var/log/hadoop-yarn or the yarn.nodemanager.local-dirs / yarn.nodemanager.log-dirs

Are there other places I can look for yarn nodemanager error logs? Does anyone know what could be causing this?

Edit: After re-checking I found this useful bit in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-(ipaddress).log

2017-04-19 14:01:14,670 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService

Solution

  • Not sure if this helps now. Probably you might have already solved it.

    You are using external shuffle service. This runs as an auxiliary service inside nodemanager service. Currently it's not able to find shuffle service jar in classpath.

    Please add location of shuffle service jar to yarn.application.classpath in yarn-site.xml