Search code examples
apache-sparkhadoopamazon-emr

Unhealthy EMR nodes "local-dirs are bad: /mnt/yarn,/mnt3/yarn"


I have a spark EMR cluster with 1 master and 8 Spot nodes. Today all the nodes dead while running a job, and spark-shell is also not assessable afterwards.

Click the 'Unhealthy Nodes' in hadoop console showing errors 2/4 local-dirs are bad: /mnt/yarn,/mnt3/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers

It seems related to the disk space issue in Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"? so I modified yarn-site.xml as described

<property>
   <name>yarn.nodemanager.disk-health-checker.enable</name>
   <value>false</value>
</property>

and restart related services as described in How to restart Spark service in EMR after changing conf settings?. But the nodes were not back alive.

sudo stop hadoop-yarn-resourcemanager  
sudo start hadoop-yarn-resourcemanager 

sudo stop spark-history-server  
sudo start spark-history-server  

sudo status hadoop-yarn-resourcemanager
sudo status spark-history-server

AWS Console enter image description here

Hadoop Console enter image description here

From dead node enter image description here


Solution

  • Do you have termination protection on? If it's on the nodes cannot be automatically killed and restarted - see https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminationProtection.html