Search code examples
hadoopsshfilesystemshdfsfsck

Slave VMs are down in CloudLab


Two of my three slave VMs are down and I can't ssh them. We have performed a hard reboot but still they are down. Any idea how to bring them back or how to debug to find the reason. Here's what jps:

3542 RunJar
9920 SecondaryNameNode
10094 ResourceManager
10244 NodeManager
8677 DataNode
31634 Jps
8536 NameNode

Here's also another detail:

ubuntu@anmol-vm1-new:~$ sudo netstat -atnp | grep 8020 
tcp        0      0 10.0.1.190:8020         0.0.0.0:*               LISTEN      8536/java       
tcp        0      0 10.0.1.190:50957        10.0.1.190:8020         ESTABLISHED 8677/java       
tcp        0      0 10.0.1.190:8020         10.0.1.190:50957        ESTABLISHED 8536/java       
tcp        0      0 10.0.1.190:8020         10.0.1.193:46627        ESTABLISHED 8536/java       
tcp        0      0 10.0.1.190:44300        10.0.1.190:8020         TIME_WAIT   -               
tcp        0      0 10.0.1.190:8020         10.0.1.190:44328        ESTABLISHED 8536/java       
tcp        0      0 10.0.1.190:8020         10.0.1.193:44610        ESTABLISHED 8536/java       
tcp6       0      0 10.0.1.190:44292        10.0.1.190:8020         TIME_WAIT   -               
tcp6       0      0 10.0.1.190:44328        10.0.1.190:8020         ESTABLISHED 10244/java      
tcp6       0      0 10.0.1.190:44252        10.0.1.190:8020         TIME_WAIT   -               
tcp6       0      0 10.0.1.190:44247        10.0.1.190:8020         TIME_WAIT   -               
tcp6       0      0 10.0.1.190:44287        10.0.1.190:8020         TIME_WAIT   -               

When I run the following command:

hadoop fsck /

the result is:

The filesystem under path '/' is CORRUPT

Here's more details in this pastebin.


Solution

  • If they are down and if you cannot ssh them, that means your filesystem can be full. You have to login using VM console and clean up the file system, ssh will not work any more.