What are all reasons due to which a node on cluster goes in unhealthy state?
Based on my limited understanding it generally happens when the HDFS utilization on the given node goes beyond a threshold value. This threshold value is defined with max-disk-utilization-per-disk-percentage property.
I have observed at times when a memory intensive spark job is triggered on spark-sql or using pyspark nodes go to unhealthy state. Upon further looking I did ssh on the node that was in unhealthy state and discovered that actually dfs utilization was less that 75% and the value that was set for the above mentioned property was 99 on my cluster.
So I presume there is some other fact that I am missing which basically causes this behavior.
Thanks in advance for your help.
Manish Mehra
YARN Nodemanager on each hadoop node(slave) will mark the node unhealthy based on heuristics determined by health-checker. By default it will be Disk checker. If set, it can also an external health checker.
The default Disk Checker
checks the free disk space on node and is if the disk(s) go over 90% it will mark the node unhealthy. (which is default and set in yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage )
In your case, You seem to be checking HDFS usage which span across nodes. You need to verify the disk utilization on individual nodes using "df -h" to check the disk usage on that node. If you see a volume like /mnt/ going over 99% , then it will be marked unhealthy.
You will need to find out top directories occupying most disk space and make appropriate actions accordingly. HDFS, which will use the disk(s) on the nodes (set using dfs.data.dir), can cause the nodes unhealthy if its utilization is very high during a job run. However, the nodes can go unhealthy without a high HDFS utilization.