Search code examples
apache-sparkgoogle-cloud-platformgoogle-cloud-dataproc

DataProc Cluster Spark Job submission fails to start NodeManager


We have Dataproc cluster with 4 workers configured. Cluster is up and running and whenever we try to submit the spark-job we are getting this error:

YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager

Some of the messages seen in Stackdriver logs are

Daemon YARN_NODE_MANAGER failed to restart

Update: This issue is noticed even while we add new worked node to the existing Dataproc cluster.

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from <MasterNode DNS> , Sending SHUTDOWN signal to the NodeManager.
    at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
    at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
    at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:845)
    at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:912)

Solution

  • This error looks like a YARN node manager decommission problem. Can you check whether there is mistake on following YARN include/exclude node configuration file in Dataproc master GCE VM:

    • /etc/hadoop/conf/nodes_exclude
    • /etc/hadoop/conf/nodes_include

    After change these config file, please run refresh node command:

    yarn rmadmin -refreshNodes 
    

    Then you should expect to see the Nodemanager rejoin the YARN.

    For details, please refer to: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html#nodeslistmanager-detects-and-handles-include-and-exclude-list-changes